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Preface 


The China Cyber Security Annual Conference is the annual event of the National Com- 
puter Network Emergency Response Technical Team/Coordination Center of China 
(hereinafter referred to as CNCERT/CC). Since 2004, CNCERT/CC has successfully 
held 18 China Cyber Security Annual Conferences. As an important bridge for technical 
and service exchange on cyber security affairs among industry, academics, and prac- 
titioners, the conference has played an active role in safeguarding cyber security and 
raising social awareness. 

Founded in August 2001, CNCERT/CC is a non-governmental non-profit cyber secu- 
rity technical center and the key coordination team for China’s cyber security emergency 
response community. As the national CERT of China, CNCERT/CC strives to improve 
the nation’s cyber security posture and safeguard the security of critical information 
infrastructure. CNCERT/CC leads efforts to prevent, detect, alert, coordinate, and han- 
dle cyber security threats and incidents, in line with the guiding principle of “proactive 
prevention, timely detection, prompt response, and maximized recovery”. 

This year, the China Cyber Security Annual Conference was held online from August 
16 to 17, 2022, on the theme of “Jointly Safeguarding Digital Information Infrastructure” 
as the 19th event in the series. The conference featured one main session and six sub- 
sessions. The mission was not only to provide a platform for sharing new emerging 
trends and concerns on cyber security, and discussing countermeasures or approaches to 
deal with them, but also to find ways to join hands in managing threats and challenges 
to digital information infrastructure. There were over 5.8 million visits received to our 
online event. Please refer to the following URL for more information: http://conf.cert. 
org.cn. 

We announced our call for papers on our official website, after which 64 submissions 
were received by the deadline from authors with a wide range of affiliations, including 
universities, research institutions, telecom operators, companies, financial institutions, 
and NGOs. After receiving all submissions, we randomly assigned every reviewer with 
five papers, and every paper was reviewed by three reviewers in a single blind manner. 
All submissions were assessed based on their credibility of innovation, contribution, 
reference value, significance of research, language quality, and originality. We adopted 
a thorough and competitive reviewing and selection process which took place in two 
rounds. In the first round we invited the reviewers to conduct an initial review. Based 
on the comments received, 34 papers passed and the authors of these 34 pre-accepted 
papers made modifications accordingly. In the second round the modified papers were 
reviewed again. Finally, 17 out of the total 64 submissions stood out and were accepted. 
The acceptance rate was 26.56%. 

The 17 papers contained in this proceedings cover a wide range of cyber-related top- 
ics, including network intrusion detection, cloud network, data security, cryptocurrency, 
vulnerabilities, mobile Internet security, threat intelligence, and webpage tempering 
detection etc. 


vi Preface 


We hereby would like to sincerely thank all the authors for their participation, and our 
thanks also go to the Program Committee for their considerable efforts and dedication 
in helping us solicit and select the papers of quality and creativity. 

Lastly, we humbly hope this proceedings of CNCERT 2022 will shed some light for 
all readers in their forthcoming research and exploration of their respective fields. 
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An Intelligent Data Flow Security Strategy 
Model of Cloud-Network Integration 


Nishui Cai‘) ©, Zhuxiang Deng, and Hao Wang 


Telecom Park, China Telecom Research Institute, Shanghai 201315, China 
cainishui@chinatelecom.cn 


Abstract. Cloud-network integration business data flow security is mainly 
reflected in the business deployment stage and online service stage. First, this 
paper analyzes the trend of the digital platform technology of the cloud-network 
integration business system, puts forward an intelligent data flow security strategy 
model of cloud-network integration, including expert rule judgment system of 
simple cloud scene and AI algorithm application model of complex cloud scene. 
Then, this paper studies hierarchical linkage cloud-network integration security 
operation system based on the security policy model of intelligent data flow and 
risk monitoring capability system for personal privacy data protection by sce- 
nario system based on the security policy model of intelligent data flow. Finally, 
this paper points out that cloud-network integration intelligent data flow security 
strategy based on AI algorithms needs to be further studied. 


Keywords: Cloud-network integration - Digital operation platform - 
Hierarchical linkage - Security operation - Data classification - Intelligent data 
flow - Security strategy - AI algorithm 


1 Introduction 


The so-called “cloud-network integration” means that the cloud is cloud computing, the 
network is the communication network, the network is the foundation, the cloud is the 
core, the network moves with the cloud, and the cloud-network is integrated. Cloud- 
network integration is China’s digital economy development strategy and enterprise 
digital transformation strategy with Chinese characteristics [1]. Among them, cloud- 
network integration is the foundation, cloud-network security is the support, digital 
platform is the hub, and scientific and technological innovation is the core. 

At this stage, the main problems faced by cloud-network operation support means are 
that the cloud-network operation support system is too scattered, the BMO data of cloud- 
network operation is not fully connected, the improvement of data enabled cloud-network 
operation efficiency is not obvious, and the application of AI injection into intelligent 
cloud-network operation is not widely used [2]. The common goal is to establish an AI 
enabled digital platform, fully understand the needs of customers, implement data-based 
decisions, provide digital business service capability and efficient response operation 
system quickly, and adapt to the rapid development of industrial digitization. 


© The Author(s) 2022 
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The new generation cloud-network operation business system should have key tech- 
nologies such as digital twinning of cloud-network resources [3], decoupled acquisition 
and control of atomic power, big data and AI enabling, cloud-network integration secu- 
rity operation. It corresponds to the resource center, acquisition and control center, big 
data and AI center and cloud-network security operation center of the system. 


e Digital twinning of cloud-network resources [4]. The resource center is responsible 
for digitizing cloud-network operation elements, depicting “cloud, network, edge and 
end” resource information, realizing unified integrated management of resources and 
network operation data, and providing standardized end-to-end cloud-network related 
resource data service capabilities and business service capabilities. 

e Atomic power decoupling acquisition and control. The acquisition and control center 
is an important foundation for cloud-network integrated operation[5]. 

e Big data and AI empowerment. The big data and AI center is responsible for mak- 
ing full use of big data and AI capabilities, connecting BMO domains and enabling 
scenario applications, such as security policy model of “intelligent data flow” [6]. 

e Cloud-network integration security operation. The cloud-network security center is 
responsible for establishing a hierarchical, domain and hierarchical cloud-network 
security protection system and a hierarchical and linked cloud-network integration 
security operation strategy to meet the security needs of data flow in the business 
deployment phase and establishing the risk monitoring capability of user personal 
information protection by scenarios to meet the security needs of user personal 
information and other important data flow in the online service phase. 


2 Intelligent Data Flow Security Strategy Model of Cloud-Network 
Integration 


The security policy model of intelligent data flow, as shown in Fig. 1, can call different 
data flow intelligent models according to different scenario applications, such as the 
security protection inter layer linkage strategy and control rules of “network moves with 
cloud and cloud moves with data”, and automatically divide new specific security area 
boundaries and security levels according to the security linkage between different layers. 


a) Intelligent data flow in simple cloud scenes 


e Data to flow: data capacity, data classification, protection requirements, etc. 

e Data flow analysis of simple cloud scenario: including reasoning and judg- 
ment based on boundary constraints and expert rules, data flow strategy and 
multi scheme decision-making selection, cloud-network characteristic capacity, 
protection level, unit energy consumption of equipment, etc. 

e Application scenario: when the data flow changes in a single cloud or two clouds, 
the rule-based intelligent model is preferred. 


b) Intelligent data flow in complex cloud scenes 


An Intelligent Data Flow Security Strategy Model 5 
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Fig. 1. Security policy model of intelligent data flow. 


e Feature extraction: flow data, multi-dimensional feature parameters of cloud and 
topological relationship of multi cloud entities. 

e Data flow analysis of complex cloud scene: intelligent judgment based on AI 
model and selection of Machine Learning Clustering Algorithm 

e Application scenario: when the data flow changes in multiple clouds and there 
are multidimensional and complex nonlinear characteristics, it is more suitable 
to apply the intelligent model based on AI algorithm. 


c) Database, intelligent model and self-learning 


e Database: including cloud feature database, hierarchical security component 
database and data flow case database. 

Intelligent model: including expert rule base and AI algorithm base. The expert 
rule base is divided into single feature rule and compound feature rule; AI 
algorithm, such as: 


Partition clustering: K-means, k-medoids 
Hierarchical clustering: birch, cure 

Cluster density: dbcsi, scan 

Grid clustering: sting, cliqu 

Mixed clustering: Gaussian mixture model, clique 


ono oP 


The self-learning of intelligent model: is to save the output result “data flow 
scheme” executed by each strategy model to the data flow case base, and then 
regularly call the latest case base for AI algorithm learning and training, so as to 
update the relevant model parameters of AI algorithm in time. 
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3 Hierarchical Linkage Cloud-Network Integration Security 
Operation System Based on the Security Policy Model 
of Intelligent Data Flow 


Intelligent operation is a new digital operation capability, and it will also be a neces- 
sary capability for enterprise digital transformation [7]. At present, intelligent operation 
needs to gradually realize intelligent operation from single scenario to global intelligent 
operation. 

Aiming at the characteristics of “the network follows the cloud and the cloud follows 
the data” of the cloud-network integration business system and the security protection 
requirements of the hierarchical and domain classification of the security domain clas- 
sification unit of the cloud-network integration business system, based on the research 
experience of the industry in network and information security strategy, this paper pro- 
poses ahierarchical linkage cloud-network integration security operation strategy to meet 
the security operation requirements of the cloud-network integration business system. 


3.1 Cloud-Network Security Protection System with Layers, Regions and Levels 


According to the national standard of Chinese information technology GB/T 22239-2019 
basic requirements for network security classification protection of information security 
technology, the security equipment or security components distributed in the network are 
classified according to “network, cloud, application, data and terminal’, so as to realize 
the hierarchical decoupling, flexible arrangement and open ability of atomic capability 
of cloud security resources. The hierarchical security capability components of “network 
cloud application data terminal” of cloud-network integration business system are shown 
in Table | below. 
The security capability components of each layer are as follows: 


a) “Network” layer security capability component. 

b) “Cloud” layer security capability component. 

c) “Application” layer security capability component. 
d) “Data” layer security capability component. 

e) “Terminal” layer security capability component. 


3.2 Hierarchical Linkage Cloud-Network Integration Security Operation System 
Based on the Security Policy Model of Intelligent Data Flow 


The hierarchical linkage cloud-network integration security operation system based on 
the security policy model of intelligent data flow is shown in Fig. 2. 

The core modules include: cloud-network integration security policy manage- 
ment point, hierarchical and domain security policy blockchain, “hierarchical linkage” 
security policy, and hierarchical and domain security control. 


— Cloud integrated security policy management point: the security administrator con- 
figures the security policy in the cloud integrated security domain unit through the 
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Table 1. List of hierarchical security capability components of cloud-network system. 


Security_ capability_component Security_layer Safety_level 
Terminal virus defense Terminal Level 2 
Terminal access management Terminal Level 3 
Terminal data leakage prevention Terminal Level 3 
Enterprise terminal leakage prevention Terminal Level 3+ 
Data identification Data Level 2 
Document encryption Data Level 2 
Data desensitization Data Level 3 
Database audit Data Level 3 
Network DLP Data Level 3 
Data encryption Data Level 3 
Data destruction Data Level 3+ 
Website content monitoring Application Level 2 
Web page tamper proof (extranet) Application Level 2 
Web page tamper proof (intranet) Application Level 3+ 
Mimicry defense Application Level 3+ 
Unified access (4A) Application Level 3 
Mobile app shell Application Level 3 
Code audit Application Level 3 
Anti DDoS Cloud Level 2 
IPS (internet outlet) Cloud Level 2 
IPS (internet outlet) Cloud Level 3 
WAF (internet outlet) Cloud Level 2 
WAF (intranet outlet) Cloud Level 3 
VPN Cloud Level 2 
Firewall (FW) Cloud Level 2 
Anti-virus gateway Cloud Level 3 
Fortress machine Cloud Level 2 
Vulnerability scanning Cloud Level 2 
Host protection Cloud Level 3+ 
Honeypot system (extranet) Cloud Level 3 
Honeypot system (intranet) Cloud Level 3+ 
Full flow (extranet) Cloud Level 3 


(continued) 
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Table 1. (continued) 


Security_ capability_component Security_layer Safety_level 
Full flow (intranet) Cloud Level 3+ 
Mobile malicious program Network Level 2 
Network abnormal traffic monitoring Network Level 2 
Flow direction monitoring Network Level 2 
Online log retention Network Level 2 
Stiff wood creep detection Network Level 2 
Unrecorded website detection Network Level 2 
Domain name information security management Network Level 2 
Spam message interception Network Level 2 
IDC/ISP Network Level 2 
DNS Network Level 3 
Attack traceability Network Level 3+ 
gees Ss5 ==> > Hierarchical 
i” Intelligent data flow" S, safety control 
Terminal security policy olicy mode i 
execution point 1 Peele 
i 
' 
Data security policy l _ Data 
execution point T security component 
Q ax ' 
a8 £3 gji 
EEE Za S i 
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ag) | ge AE 
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Fig. 2. Cloud-network integration security operation system with hierarchical linkage based on 
the security policy model of intelligent data flow. 


security policy management point, including the setting of security parameters, uni- 
fied security marks for subjects and objects, authorization of subjects, configuration 
of trusted authentication policies, etc. 

— Hierarchical and domain security policy blockchain: timely release to the security 
policy execution-point of each layer through the blockchain. 

— “Layered linkage” security strategy: the execution-point of each layer’s security strat- 
egy is responsible for the query and linkage adjustment of this layer’s security strategy 
and security control rules; The “layered linkage” security policy rule base can call the 
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security linkage rules between different layers according to the security protection 
inter layer linkage policy and control rules of “the network moves with the cloud and 
the cloud moves with the number”, and automatically divide the new specific security 
area boundary and security level. 

— Hierarchical and sub domain hierarchical security control: implement security poli- 
cies and security control rules hierarchically, carry out “network cloud application 
data terminal” hierarchical and sub domain security level protection according to the 
security level of cloud-network integrated security domain unit, and automatically 
control the security equipment or security components distributed in the network. 


3.3 Feasibility Verification of Cloud-Network Integration “Intelligent Data 
Flow” Security Strategy 


Verification Flow Chart 
Hierarchical linkage cloud-network integration security operation flow chart, as shown 


in Fig. 3. 


a 2 


The security policy blockchain updates the 
security policy and regional boundary 


Fig. 3. Hierarchical linkage cloud-network integration security operation flow chart. 
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According to Fig. 3, the following simulation verifies the security strategy of cloud- 
network integration “intelligent data flow”. Since there are only two clouds in the applica- 
tion scenario, the simple cloud scenario expert rule model will be called in the simulation 
process. 


Application Case of Hierarchical Rules 

Through the security policy management point, the security administrator configures 
the security policy in the cloud-network integrated security domain unit, including the 
setting of security parameters, unified security marks (Se_Token) for subjects and objects, 
boundary range of security area (Zone_defense), authorization of subjects, configuration 
of trusted authentication strategy, etc. 


Security_Police{Se_Token(Subjects, Objects), Zone_defense} 


SO: Cloud-Network Characteristics and Initialization Security Policy Parameters of 
Layered and Domain 


1. S01 Existing Cloud Feature “Layered Linkage” Security Policy Rule Base 


SPRB(zoneO, zonel, zone2 ...) 


— Cloud C1, 500 GB of available storage space, corresponding network boundaries 
N1 and N2, and the initial protection level is level 2. 

— Cloud C2, available storage space 2000 GB, corresponding network boundaries 
N3 and N4, initial protection level 2. 


2. S02 Initial Security Policy Parameters 
SPO{ST (Subj0, Obj0), ZoneO} 


— Cloud Cl, used storage space 400 gb, remaining available storage space 100 GB, 
corresponding network boundaries N1 and N2, initial protection level 2. 

— Cloud C2, unused, remaining available storage space 2000 g, corresponding 
network boundaries N3 and N4, initial protection level 2. (See Table 2) 


Table 2. Cloud-network integration “intelligent data flow” process table (initial state). 
Operation Cloud | Maximum Used space | Free space | Network | Safety level 
storage space remaining | boundary 
Initial state | Cl 500 GB 400 GB 100 GB N1, N2 Leve 2 
Initial state | C2 2000 GB 0 GB 2000 GB N3, N4 Level 2 
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S1: Cloud-Network Integration Security Policy Configuration 


SP{ST (Subj, Obj), Zone} 


e Operation 1: add 150 GB data storage and related application deployment, with level 


3 security protection. 


e Operation 2: reduce 100 GB data storage and related application deployment, with 


level 3 security protection. 


S2: Hierarchical and Domain-Based Security Policy Blockchain 
Timely release to the security policy execution-point of each layer through the 


blockchain. 


a) 


S3: “Layered Linkage” Security Policy Model 


S31: Implementation Points of Security Policies at All Levels 

Receive the security policy issued by the security policy blockchain, call $32 layered 
linkage security policy rules to calculate the minimum protection area (MinZone) 
and security protection level (MaxST), form the adjusted overall requirements of 
cloud-network security protection, then determine the security policy of this layer, 
and query the security capability components of relevant security protection levels 
of this layer according to the cloud-network integration layered protection secu- 
rity capability component system diagram and security protection level, And issue 
relevant security policy adjustment instructions at all levels. 

$32: “Layered Linkage” Security Policy Rule Base 


SPRB (zoneO, zonel, zone2...) 


According to the security protection linkage strategy and control rules of “the 
network moves with the cloud and the cloud moves with the data”, the security 
linkage rules between different layers can be called to automatically divide the new 
specific security area boundary and security level(1). 


SP = SP + SPO (1) 
SP = {MaxST(SubjO + Subj, Obj0 + Obj), MinZone} 
Operation 1: Add a Protection Object 

Scheme 1: C1 first and then C2, and determine the minimum protection area 
(MinZone) according to the capacity of the protected object: 


400 GB + 150 GB = 550 GB 


Determine the safety protection level (MaxST) according to the highest level of 
the protected object: 

Adjust the security protection level of C1, C2 and corresponding network 
boundary to level 3. 
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— Cloud C1, 500 GB of used storage space, no remaining available storage space, 
corresponding network boundaries N1 and N2, need to be reinforced with 
protection level 3. 

— Cloud C2, 50 GB used this time, 1950 GB of remaining available storage 
space, corresponding network boundaries N3 and N4, need to be reinforced with 
protection level 3. 


Similarly, scheme 2, C2 first and then C1... and so on. (See Table 3). 


Table 3. Cloud-network integration “intelligent data flow” process table (operation 1) 


Operation Cloud | Maximum Used space | Free space | Network | Safety level 
storage space remaining | boundary 

Initial state Cl 500 GB 400 GB 100 GB N1, N2 Leve 2 
Initial state C2 2000 GB 0 GB 2000 GB N3, N4 Level 2 
Scheme 1 Cl 500 GB 500 GB 0 GB N1, N2 Level 3 
(Operation 1) 

Scheme 1 C2 2000 GB 50 GB 1950 GB N3, N4 Level 3 
(Operation 1) 

Scheme 2 Cl 500 GB 400 GB 100 GB N1, N2 Level 2 
(Operation 1) 

Scheme 2 C2 2000 GB 150 GB 1850 GB N3, N4 Level 3 
(Operation 1) 


b) Operation 2: Reduce Protected Objects 
Similarly...... (See Table 4). 


S4: Hierarchical Security Control 

After each security operation policy adjustment operation, immediately receive and 
execute the security policy adjustment instructions of each layer, query the corresponding 
security capability components in the hierarchical security capability component list 
of cloud-network integration business system according to Table 1, and automatically 
control the security equipment or security components distributed in the cloud-network 
integration system, that is, the network, cloud, application, data, terminal to load and 
reinforce the corresponding level of safety protection equipment and application safety 
components respectively. 

By adding and reducing protection objects and protection requirements, the protec- 
tion strategies of "network, cloud, application, data and terminal" of the cloud system 
have been adjusted automatically and implemented through the hierarchical security 
control points. 
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Table 4. Cloud-network integration “intelligent data flow” process table (operation 2) 


Operation Cloud | Maximum Used space | Free space | Network | Safety level 
storage space remaining | boundary 

Initial state Cl 500 GB 400 GB 100 GB N1, N2 Leve 2 
Initial state C2 2000 GB 0 GB 2000 GB N3, N4 Level 2 
Scheme 1 Cl 500 GB 500 GB 0 GB N1, N2 Level 3 
(Operation 1) 

Scheme 1 C2 2000 GB 50 GB 1950 GB N3, N4 Level 3 
(Operation 1) 

Scheme 2 Cl 500 GB 400 GB 100 GB N1, N2 Level 2 
(Operation 1) 

Scheme 2 C2 2000 GB 150 GB 1850 GB N3, N4 Level 3 
(Operation 1) 

Scheme 1 C1 500 GB 450 GB 50 GB N1, N2 Level 3 
(Operation 2) 

Scheme 1 C2 2000 GB 0 GB 2000 GB N3, N4 Level 2 
(Operation 2) 

Scheme 2 Cl 500 GB 400 GB 100 GB N1, N2 Level 2 
(Operation 2) 

Scheme 2 C2 2000 GB 50 GB 1950 GB N3, N4 Level 3 
(Operation 2) 


3.3.1 Application Case of Intelligent Multi-cloud Resource Scheduling 


Feature Selection in the Sample Space of Resource Scheduling AI Algorithm in Multi- 
cloud Scenarios 


General principles to be followed: 


Private cloud resources are scheduled and used preferentially. Only when private 
cloud resources are insufficient can they be dispatched to the industry cloud or 
public cloud. 

Sort according to the billing cost of the industry cloud or public cloud, and give 
priority to the low-cost industry cloud or public cloud. 

Evaluate the security capability of the public cloud according to the business or data 
security level, and calculate it as a resource scheduling parameter measure. Data 
security capability is an important indicator to evaluate the public cloud. 

Evaluate according to indicators such as public cloud reliability and resource 
effectiveness, and calculate them as scheduling parameter measures. 


The above principles can be used as a basis for evaluating the importance of feature 


parameters when AI algorithm selects feature space. In order to better reflect the princi- 
ples of resource scheduling in a multi-cloud scenario, the operation log of each multi- 
cloud resource scheduling is generalized. Each scheduling operation is taken as a feature 
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sequence, and features with strong correlation are selected for vector representation, 
which is stored in the case database as a case training set. 

The characteristic variable name conforming to the above scheduling principle is 
generalized to < cl_ free_ space >, < c2_ free_ space >, < cl_safety_ level >, < c2_ 
safety_ level >, < cl_ unit price >, < c2_ Unit price >, which respectively represents 
the utilization rate, security level and unit price of economic indicators of the cloud. And 
< demand_cloud_space >, < demand_safety_level > representing resource scheduling 
requirements. Due to < cl_ safety_ level >, < c2_ safety_ level >, < cl_ unit price 
>, < c2_ unit price > the relevant features are relatively stable in resource scheduling. 
The relevant features are not considered temporarily. Here, only important features are 
considered to form the sample feature space. 

Multi-cloud Resource Scheduling Method Based on KNN Algorithm 

The advantage of KNN algorithm is that it can deal with classification problems and 
regression problems. At the same time, it has strong anti-interference and high accuracy. 
The low efficiency of the algorithm can be avoided by updating the control sample size, 
which is more suitable for the operation log size of multi-cloud resource scheduling. 

Now only the features < cl_ free_ space >, < c2_free_ space > in the feature space 
are taken, assuming that the unknown samples are serialized as follows: 


(demand_cloud_space, cl_free_space, c2_free_space) = (100, 350, 200) take k = 3. 


Query the training sample Table 5, calculate the nearest neighbor distance, and 
determine that the samples with ID6, ID7, and ID8 are k nearest neighbor samples. ID6 
and ID7 belong to class 2 and ID8 belong to class 1. Thus, this time, they are classified 
as class 2 and the corresponding policy_ scheme (50, 50), where cloud! and cloud2 
respectively schedule 50 GB of resource space. (See Table 5). 


Table 5. Training sample set 


ID |Demand | Demand | Cl Cl Cl C2 C2 C2 Police | Class 
cloud safety unit |safety free junit | safety | free scheme 
space level price |level | space | price | level | space | (cl, c2) 

1 100 2 10 3 200 5 2 700 (100,0) |1 

2 |100 3 10 3 250 5 2 600 (50, 50) | 2 

3 | 100 2 10 3 350 5 2 350 (50, 50) |2 

4 |100 2 10 3 750 5 2 100 (100,0) |1 

5 |100 3 10 3 300 5 2 800 (0, 100) |3 

6 | 100 2 10 3 400 5 2 200 (50, 50) | 2 

7 |100 3 10 3 350 5 2 250 (50, 50) |2 

8 |100 2 10 3 450 5 2 100 (100,0) |1 

9 |100 2 10 3 500 5 2 850 (0, 100) |3 

10 |100 3 10 3 350 5 2 550 (50, 50) |2 
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4 Risk Monitoring Capability System for Personal Privacy Data 
Protection by Scenario System Based on the Security Policy 
Model of Intelligent Data Flow 


The effective methods for monitoring the personal information protection risk of the 
business system are as follows: 


e First of all, according to the user’s personal information protection compliance 
requirements, the basic model library of user’s personal information protection risk 
monitoring of the business system is established. 

e Then, according to the specific situation of the business function process of the busi- 
ness system, the key node view of personal information protection monitoring of each 
business process of the business system is established. 

e Finally, in combination with personal information protection requirements, according 
to the basic model library of business system users’ personal information protection 
risk monitoring, corresponding risk detection models are allocated to form a scenario 
specific business risk identification model for risk identification and analysis. 


In this way, it not only solves the problem of visual display of key nodes of user’s 
personal information protection in the business system; It also meets the accurate require- 
ments of the risk monitoring model of each key node, thus improving the accuracy and 
efficiency of the user’s personal information protection risk monitoring. 


4.1 List of Basic Risk Models for Rule-Based Personal Privacy Protection 


The basic risk models for rule-based personal privacy protection are shown in Table 6. 
The basic risk models can be divided into five categories: 


a) account risk model. 

b) exposure risk model. 

c) authority risk model. 

d) transmission risk model. 

e) abnormal behavior risk model. 
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Table 6. List of risk models for rule-based personal privacy protection. 


ID | Risk Model name | Model function Model usage Include parameters 
classification description 
1 | 1, Account Password It is found that Restore HTTP | Request header, 
risk plaintext illegal personnel login request request body, 
identification | disclosure may obtain the Identify response header, 
login information | account and response body 
of site users through | password fields | Account (name, 
sniffing, listening, | Account and username) 
URL interception | password Password (PWD, 
and other means plaintext passwd) 
Plaintext, non 
plaintext 
2 |1,Account | Weak Find and identify Restore HTTP | Request header, 
risk password the login login request request body, 
identification information of the | Identification | response header, 
system account and | account and response body 
password that may | password fields | Account (name, 
be obtained by Match weak username) 
illegal personnel cipher Library | Password (PWD, 
through dictionary passwd) 
guessing and Plaintext, MD5, 
Internet SHA1 and other 
weak cipher 
Libraries 
3 |1, Account Unreasonable | It is found that Restore HTTP | Request header, 
risk login illegal personnel login request request body, 
identification | authentication | may steal the user’s | Extract login response header, 
method personal authentication | response body 
information without | Judge Account (name, 
verification through | authentication | username) 
the get mode mode Password (PWD, 
passwd) 
Submit as get 
Submit in the form 
of post, and the 
cookie contains 
4 |2,Exposure | Multiple types | Discover and Restore HTTP | Request content and 
risk of personal identify pages that | request content | response content 
identification | information can access multiple | and response Data elements (ID 
access types of personal content card, mobile 
information, Personal number, name, etc.) 
resulting in information Data classification 
increased identification 
associativity of Judge personal 
personal information 
information category 
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ID | Risk Model name | Model function Model usage Include parameters 
classification description 
5 | 2, Exposure | Personal Identify Restore HTTP | Request content and 
risk sensitive information that request content | response content 
identification | information is | should be and response Data elements (ID 
not desensitized but not | content card, mobile 
desensitized desensitized to Personal number, name, etc.) 
identify sensitive information Plaintext, non 
personal identification | plaintext 
information Desensitization 
message 
plaintext 
6 |2, Exposure | Inconsistent It is found that data | Restore HTTP | Request content and 
risk desensitization | desensitization is request content | response content 
identification | of personal not carried out and response Name (Zhang * *, 
sensitive according to content Zhang *, etc., only 
information relevant regulations | built-in one digit is 
types during standard displayed) 
identification personal ID number 
transmission sensitive (513425*****4325, 
information BEARER ER EERE) 
desensitization | Name (Zhang * 
rules CAI) 
Desensitization | ID number 
rules for (51***1990**4990, 
non-standard etc.) 
personal 
sensitive 
information 
7 |2, Exposure | File transfer Discover and Restore http file | File format (PDF, 
risk involving identify whether transfer request | pptx, xlsx, xls, 
identification | personal personal file personal docx, Doc, RTF, 
information information is information XT, GZ, 7z, RAR) 
involved in file identification | Number of 
content during file decompression 


transmission 


layers (customized) 
Data element rules 
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ID | Risk Model name | Model function Model usage Include parameters 
classification description 
8 | 2, Exposure | Mass file Discover that a Protocol Request header, 
risk transfer large amount of resolution request body, 
identification unrecognized restore access | response header, 
personal log response body 
information is statistics traffic | Source IP 
transmitted by size of file Content size 
means of transfer (custom m) 
encryption and Traffic file 
compression threshold 
9 |2,Exposure | Mail transfer | Discover and Restore the Sender, recipient, 
risk involving identify that message title, body and 
identification | personal internal personnel | transfer log attachment can be 
information may transmit Identify downloaded 
personal personal The system has 
information information in | built-in 
through e-mail the message identification rules 
body or and supports new 
attachment addition 
Count the Source IP 
number of 
personal 
information 
data 
10 | 2, Exposure | Too many There may be too | Restore the Request content and 
risk pieces of data | many pieces of data | batch access response content 
identification | returned ata | returned, which behavior log Data element rules 
time may cause the Extract the Number of entries 
operator to obtain | number of (customized) 
personal personal 
information information for 
unrelated to the this visit 
business Match set 
threshold 
11 | 2, Exposure | Single return | Operators’ single The return Request content and 
risk content size is | access to data that | content of the | response content 
identification | too large exceeds business URL access Source IP 
requirements leads | request M (custom) 
to a large number of Count the size 
personal of returned 
information content per 
disclosure access 
Match set 
threshold 
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ID | Risk Model name | Model function Model usage Include parameters 
classification description 
12 | 2, Exposure | Batch access | Discover and Restore http File format (PDF, 
risk to files identify files file transfer pptx, xlsx, xls, 
identification | involving containing personal | requests docx, Doc, RTF, 
personal information for Extract XT, GZ, 7z, RAR) 
information batch transfer personal Number of 
information decompression 
from file layers (customized) 
transfer Number 
Threshold (customized) 
value of the 
number of 
pieces of 
personal 
information 
contained in 
the file 
13 |2, Exposure | Personal The URL carries Restore the Request content and 
risk sensitive the account HTTP request | response content 
identification | information parameters. The content and Data element rules 
leak operator can access | response “Desensitized (yes, 
desensitization | the user’s personal | content no) 
information beyond | Personal Not desensitized 
his authority by information (yes, no) 
changing the identification 
account Matched leak 
desensitization 
14 | 3, Authority | Horizontal The SQL statement | The URL Account (name, 
risk ultra vires can be traversed and | carries the username) 
identification executed, and the account Replay (success, 
operator can access | parameters failure) 
the user’s personal Replay HTTP 
information beyond | request 


his authority by 
changing the 
account 
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Table 6. (continued) 
ID | Risk Model name | Model function Model usage Include parameters 
classification description 
15 | 3, Authority | SQL statement The SQL statement Log restore HTTP (request 
risk traversable can be traversed and Resend SQL content, response 
identification | execution executed, and the executable content) 
operator can access Data element rules 
the user’s personal Replay (success, 
information beyond failure) 
his authority by 
changing the 
account 
16 | 3, Authority | Interface not There may be the | Automatically | IP model 
risk recorded act of privately discover Interface filing list 
identification developing interface assets 
interfaces to obtain | Matching 
personal interface filing 
information, list 
resulting in the 
disclosure of 
personal 
information 
17 | 3, Authority | Interface The interface for Automatically | IP model 
risk unauthorized | obtaining personal | discover Whether the source 
identification | discovery information may be | interfaces IP is the host IP 
invoked without Record Interface 
authorization, interconnection | authorization list 
resulting in access 
personal relationship 
information Matching 
disclosure interface 
authorization 
information 
18 | 3, Authority | Expired but Expired interfaces Automatically | Whether the source 
risk not offline are secretly called | discover IP is the host IP 
identification | interface by illegal personnel | interface time point 
access to obtain personal | interconnection | Last access time of 
information, which | access export interface 
leads to personal relationships 


information 
disclosure 


Record the last 
access time of 
the interface 
Matching 
interface offline 
time 
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ID | Risk Model name | Model function Model usage Include parameters 
classification description 
19 | 3, Authority | Invalid When an employee Account Account (name, 
risk account access | transfers or leaves | withdrawal username) 
identification the company, he/ | Last access Password (PWD, 
she will time of account | passwd) 
immediately Matching time point 
withdraw all his/ | account life Export account list 
her job numbers, cycle 
access rights and 
other permissions, 
and delete relevant 
passwords, which 
may lead to illegal 
personnel using 
invalid account 
numbers to obtain 
personal 
information, avoid 
inspection, and lead 
to personal 
information 
disclosure 
20 | 3, Authority | Account is not | Strictly control the | Account Account (name, 
risk authorized to | high-risk withdrawal username) 
identification | access permissions. Do not | Count th Count | Password (PWD, 
personal query the user’s the URL passwd) 
information personal sensitive | fingerprints URL list 


information without 
the user’s 
authorization. 
Unauthorized 
access to personal 
information by the 
account may cause 
personal 
information 
disclosure 


accessed by 
this accounte 
URL 
fingerprints 
accessed by 
this account 
Count the types 
of personal 
information 
accessed by 
this account 
Matching 
account 
authorization 
information 


Data identification 
rules 

Export sensitive 
account list 
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Table 6. (continued) 
ID | Risk Model name | Model function Model usage Include parameters 
classification description 
2114, IP accessing | External personnel | Identify the IP home 
Transmission | private invade the intranet | server IP IP home 
risk network from | system through Client IP Client IP (public 
identification | public network | vulnerabilities, ownership network) 
which may lead to | Judge IP Host IP (intranet) 
personal ownership 
information leakage 
or intranet security 
risks 
22 | 5, Abnormal | Account The data interface | Account Account (name, 
behavior risk | remote login is limited by IP withdrawal username) 
identification address Client IP Password (PWD, 
authentication and | address home | passwd) 
illegal login times. | Illegal zone IP address home list 
The account may be | login Account IP address 
leaked or pool 
overstepped, 
resulting in 
personal 
information leakage 
and increasing the 
difficulty of tracing 
23 |5, Abnormal | IP The data interface | Account Account (name, 
behavior risk | multiplexing is limited by IP withdrawal username) 
identification address Number of Password (PWD, 
authentication and | login accounts | passwd) 
illegal login times. | of the same IP | Time range (custom 
The account may be Set threshold | minutes) 
leaked or Number 
overstepped, (customized) 
resulting in 
personal 
information 
24 | 5, Abnormal | Account reuse There are multiple | Account Account (name, 
behavior risk accounts operating | withdrawal username) 
identification on the same IP, Number of Password (PWD, 
which may have login IPS per | passwd) 
been leaked or used | account Time range (custom 
horizontally beyond Set threshold | minutes) 
their authority Number 
(customized) 
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Table 6. (continued) 


ID | Risk Model name | Model function Model usage Include parameters 
classification description 
25 | 5, Abnormal | Account The use of accounts | Restore Database protocol 
behavior risk | database by multiple people | database restore 
identification | deletion may lead to operation Delete, drop, etc 
uncontrollable key | command 
operational risks of | Identify 
the business database delete 
system, resulting in | operation 
cross permissions, | instructions 
increasing the 
difficulty of 
accountability and 
other risks 
26 | 5, Abnormal | Key A role authority Restore Database protocol 
behavior risk | instruction administrator database restore 
identification | operation of | should be set to operation create ~ update., etc 
account uniformly manage | command 
database the addition, Identify key 
deletion and database 
modification of role operation 
authority, and instructions 
identify the 
possible tampering, 
overwriting, 
deletion and 
addition of key 
personal 
information in the 
database, resulting 
in incomplete or 
lost personal 
information 
27 | 5, Abnormal | IP short-time | Identify the access | Number of Client IP 
behavior risk | access of the same IP ina | visits Time (custom 
identification | frequency short period of Match set minutes) 
exception time. High threshold URL (number of 
frequency access to visits) 
personal 
information may 
cause mass 
disclosure of 
personal 
information 
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Table 6. (continued) 
ID | Risk Model name | Model function Model usage Include parameters 
classification description 
28 |5, Abnormal | Account Identify the Account Account (name, 
behavior risk | short-time abnormal behavior | withdrawal username) 
identification | access of the account to Number of Password (PWD, 
frequency is access sensitive visits passwd) 
abnormal personal data for Match set Client IP 
many times in a threshold Time (custom 
short time, and the minutes) 
high-frequency URL (number of 
access of personal visits) 
information may 
lead to the mass 
disclosure of 
personal 
information 
29 | 5, Abnormal | IP machine Identify that a large Cluster Proportion of get 
behavior risk | crawler amount of personal | analysis requests and post 
identification | behavior information is Weight requests 
obtained through proportion Exception time 
machine behavior | Request Abnormal 
and abnormal time | characteristics | frequency 
nodes. Personal Abnormal access 
information is traffic size 
crawled in batches Reference is null, 
by crawlers, user agent is not 
resulting in standard 
personal 
information leakage 
30 | 5, Abnormal | Too many IP | We should focus on Number of Time (custom 
behavior risk | (short time / auditing the batch | accesses minutes) current 
identification | single day) operation of key Set threshold | day 
access data data, and identify Number of entries 
the behaviors of (customized) 
illegal access to too 
much personal 
information within 
a limited time 
31 |5, Abnormal | IP (short time / | The batch operation | Traffic size Time (custom 
behavior risk | single day) of key data shall be Set threshold | minutes) and 
identification | access data audited to identify current day 
size is the abnormal Number of entries 
abnormal behavior that the (customized) 
amount of access 
data exceeds the 
daily access value 
in a short time 
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ID | Risk Model name | Model function Model usage Include parameters 
classification description 
32 | 5, Abnormal | Abnormal Identify the Access time Time (custom 
behavior risk | number of IP | abnormal behavior | period minutes) and 
identification | (short time / of downloading a Set threshold current day 
single day) large amount of Number 
downloaded personal (customized) 
files information within 
a limited time of an 
IP, which is easy to 
cause a large 
amount of personal 
data leakage 
33 | 5, Abnormal | Abnormal Identify and find Access time Division of non 
behavior risk | number of IP | that there are period working hours 
identification | data acquired | abnormal times of | Set threshold | Number of entries 
during non accessing services (customized) 
working hours in the network 
during non working 
hours, accessing a 
large amount of 
data, and abnormal 
behaviors of data 
leakage 


4.2 Risk Identification of Personal Privacy Data Protection in Complex Scenarios 


Risk Identification of Personal Privacy Data Protection in Complex Scenarios Based 
on Rules 

Batch information export is an important and complex scenario for personal privacy data 
protection. Here, it is simply divided into two stages: authentication and authorization 
and information export. It is shown in Fig. 4. 


Batch Information Export Scenario Risk Monitoring Process 


Step 1: establish batch information according to the management requirements and 
export the scene management requirements feature matrix. 

Step 2: data identification and analysis, that is, access monitoring business system sce- 
narios, mirror business system scenarios, user access traffic data, batch export related 
multi log multi-dimensional data modeling, including approval, bank mode, permission 
range and other data for feature rule modeling. 

Step 3: risk identification based on rule model, extract and identify models according 
to key data requirements through protocol analysis and request data analysis, including 
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Risk identification model for batch information export scenarios 


Certification authorization 
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Fig. 4. Batch information export. 


data type, regular expression, eigenvalue matching, rule matching, behavior matching, 
etc. 

Step 4: risk identification of AI model by scenario, comparative analysis of various 
scenario features based on AI and big data technology, and identification of risk scenarios 
- batch export of AI model feature matching algorithm. Through UEBA user behavior 
analysis technology, according to the behavior baseline of big data statistical analysis, 
judge whether it belongs to abnormal behavior derived from batch information, and 
identify corresponding risks. 

Step 5: analyze the authentication model, approve the score scenario information, and 
judge the compliance of scenario behavior - compare and identify the access behavior 
and risk of batch exported user information. 

Step 6: optimize AI algorithm model to realize self-learning. Based on AI technologies 
such as machine learning and NLP, the AI algorithm model, strategy and feature base 
are derived by iteratively optimizing batch information. 


5 Conclusion 


The security domain unit of cloud-network integration business system has the character- 
istics of “network cloud application data terminal” layered and sub-domain hierarchical 
protection and “network moves with cloud and cloud moves with data”. This paper 
puts forward the intelligent data flow security strategy model of cloud-network integra- 
tion, including expert rule judgment system of simple cloud scene and AI algorithm 
application model of complex cloud scene, which can be applied to hierarchical link- 
age cloud-network integration security operation system and risk monitoring capability 


An Intelligent Data Flow Security Strategy Model 27 


system for personal privacy data protection by scenario system. With the acceleration of 
enterprise digital transformation and the massive growth of cloud-network integration 
services, AI algorithm application model of complex cloud scene is an important con- 
tent of in-depth research in the field of intelligent security operation of cloud-network 
integration in the next stage. 
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Abstract. Since the rapid growth of big data technology and the continuous devel- 
opment of information technology in recent years, the significance of network 
security monitoring is increasing consistently. As one of the major tools to secure 
the system environment, organizations use various monitoring devices to gov- 
ern the utilities of networks, hardware and applications. Meanwhile, massive and 
redundant data are produced by these devices constantly, which make a huge prob- 
lem for analysts and scientists who are willing to extract useful information from 
them, and even impact the accuracy and efficiency of the monitoring systems. In 
this paper, we employ random forest algorithm and propose an ensemble learning 
model under certain scenarios with fixed data features. We use a preprocessing 
method to balance positive and negative samples, and then use 6 different intrusion 
detection systems as weak classifiers, which satisfy the rules of “partial sampling” 
and “partial features selection” of ensemble learning. Finally, we test three combi- 
nation strategies, including relative majority voting, weighted voting and stacking, 
to combine the predictions. Experiments show that stacking has a better perfor- 
mance than the other two, with a score of 98.25% in recall, and achieves a 47.91% 
precision. 


Keywords: Random Forest - Network Security - Monitoring and Analysis - 
Ensemble Learning - Imbalanced Classification 


1 Introduction 


In recent years, network security monitoring has developed rapidly and played a sig- 
nificant role in network security. Network security monitoring is the prerequisite of a 
protected and functional network system. In the context of big data, network monitoring 
data are produced and altered endlessly. Network monitoring systems not only need to 
recognize the traditional risks such as spiders, port scanning, webshell, injection attack, 
advanced persistent threat, and phishing mail, but also have to discover the emerging risks 
such as privacy disclosure, information leakage, data theft, etc. In order to solve these 
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problems, it is necessary to integrate the strengths of multiple security systems and plat- 
forms, which include internet probe, situation awareness system, internet management 
system, terminal detection system, database protection system, and so forth. However, 
these systems and platforms are mostly self-contained, which have fuzzy boundaries 
and duplicated functions. If their advantages can be combined and weaknesses can be 
complemented in one mechanism, it will reduce unneeded human labor and increase the 
overall efficiency. 

Ensemble learning is the ideal method for solving the problem. Each intrusion detec- 
tion system can be treated as a weak classifier to distinguish normal and intrusive data, 
through the integration of several weak classifiers, it will generate a strong classifier 
with more precise results and higher effectiveness. 

In 2001, Giacinto et al. started to solve intrusion detection problems using ensemble 
learning method [1]. In 2008, Giacinto et al. proposed an ensemble learning method 
which could detect and discover unknown types of intrusion [2]. Random forest algorithm 
has been widely used and approved to be effective in intrusion detection ensembles. The 
common process is to extract the syntax features from PHP code through text analysis, 
and then build the webshell detection model [3, 4]. Because webshell contains both 
behavioral features and static text features, it is possible to build a stronger feature 
combination by merging behavioral features with text static features [5, 6]. Another 
method is combining random forest with deep learning to build a network intrusion 
detection model through deep random forest, which can handle more complex and huge 
datasets [7]. 

Researchers in intrusion detection often choose public datasets, such as NSL-KDD, 
ISC2012, ADFA13, DARPA98, or public repositories such as Github. Most of the public 
datasets are cleaned and balanced, with a proper balance rate of normal and intrusive 
data, which is suitable for algorithm research. But these datasets are outdated and are 
not able to reflect the newest trend in intrusion detection. According to certain scenarios, 
it is necessary to collect specific data and construct a specialized dataset [8]. 

In this paper, we use a dataset from recall sampling after desensitization of real 
data, which is deeply imbalanced. In order to adjust the ratio of different samples in an 
imbalanced dataset, the primary machine learning solutions are undersampling and over- 
sampling. By adjusting model quality metrics for different categories, we can mitigate 
model failure caused by data imbalance [9, 10]. 


2 Network Security Monitoring and Random Forest 


2.1 Network Security Monitoring 


Network security monitoring is a technology that through collecting and analyzing attack 
alarms to enhance the responses to network intrusions. To conduct the network traffic 
analysis, people generally export network flow replica via a private network switch 
and execute the analytical procedures in a dedicated server. By using data presentation 
tools, data transmission tools, and data collection tools to analyze network traffic, flow 
information such as sessions, transactions, statistics, metadata, and alert data can be 
extracted. By analyzing various types of monitoring data, digital threats and intruders 
can be controlled to ensure network security. 
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Fig. 1. Network security monitoring process diagram 


Figure 1 shows the process diagram of network security monitoring. Network mon- 
itoring analysis usually relies on analytical skills of the monitoring staff. Monitoring 
staff are in charge with extracting information from thousands of alarm data, analyzing 
and defining the misreporting rate, threaten level and hazard level of each alarm, and 
implement relevant responses appropriately. In addition to their analytical skill level, 
monitoring staff also need a thorough understanding of the network environment in spe- 
cific field, including but not limited to business data patterns, asset locations, etc. They 
must identify and response to the intrusions timely from numerous alarm data in the 
complicate environments, and keep tracking the subsequent events and potential risks. 

Time is the most important factor in safeguarding the network system. In one sense, 
misreporting can lead to serious failures because the monitoring staff are unable to deal 
with intrusions timely. On the other hand, underreporting can cause more risky situations 
which are hard to predict. Therefore, with the development of network security in recent 
years, monitoring systems become more and more comprehensive. With the arrival of 
the big data era and the improvement of computing power, the quantity and repeatability 
of security data are increased tremendously. New security risks, especially data security 
risks arise. These factors are challenges to the real-time monitoring. 

Because each intrusion detection system has its own technical advantages, the com- 
bination of these systems are fairly complex. In practice, administrators must patrol all 
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intrusion detection systems at the same time during a monitoring process. In addition 
to intrusion detection tools, monitoring staff should be able to operate other systems 
flexibly, including asset mapping system, log audit system, host scanning system, secu- 
rity disposal tools, security filing platform, external intelligence platform, tracking and 
recording platform, and so on. The complex environment and complicated functions 
also challenge the real-time monitoring. 

From the perspective of alarm data itself, in the practical monitoring exercises, most 
of the alarm information belong to misreported data. Among the alarm data, most intru- 
sions are crawlers, port scanning or vulnerability detection, which are highly repeated and 
lower threatened. The true threatening invasion are difficult to discover at first because 
they are hidden in a lot of worthless data. 

To face these challenges, a common solution is building network security policies. 
But policies are always static and fixed, which can be slow to adapt to the network 
environment changes and cannot be simply applied to all the services and systems. 

Based on above conditions, we propose a new solution to decrease the amount of 
data size and increase the efficiency of security system devices by further screening and 
classifying of the alarm data. 


2.2 Random Forest 


Machine learning (ML) has made great achievements in automated classification tasks 
in recent years, and one of the popular field in ML is ensemble learning. By training 
multiple weak classifiers and combine them into a strong classifier, ensemble learning 
can solve a classification problem jointly. Generally speaking, the classifier generated by 
ensemble learning is more precise than any of the weak classifiers. Sampling methods 
such as boosting and bagging are commonly used in ensemble learning. As combination 
strategies, except voting methods such as average method and relative majority vot- 
ing method, stacking method is also used which integrating and combining models by 
constructing learners. Random forest is an important method widely used in ensemble 
learning. 

Random forest is an integrated classifier based on bagging expansion, and consists 
of many decision trees. The predictive output of the classifier is combined after each 
decision tree is classified. Based on bagging, random feature selection is introduced into 
random forest. In another words, we need to make a random selection for a feature subset 
before classification, and then conduct the classification task on the subset. 

From the perspective of machine learning, we can treat intrusion detection as a 
classification task. Intrusion detection systems can transform raw data into structured 
data tables using data representation tool, then classify data according to attack features 
and attack types after analyzing them. Each intrusion detection system can be treated as 
a weak classifier to execute classification. Because of the inaccuracy of the classification 
result of each weak classifier, we can generate an integrated classifiers using ensemble 
learning to improve the precision. 

By further studies on the data features of intrusion detection in monitoring analysis, 
we found that each intrusion detection system can only identify part of the attack features 
because different intrusion detection systems come from different manufacturers with 
different application scenarios. On the other hand, the network traffic capture method 
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can be consider as a special bagging in machine learning because each intrusion detec- 
tion system is deployed at different positions of the network system, which captures 
incomprehensive and overlapping network traffic data. Therefore, when we use indi- 
vidual intrusion detection system as a classifier, it naturally satisfies the two properties 
of “partial sampling” and “partial feature selection”. Based on the ideology of random 
forest, we use ensemble learning method to conduct network monitoring analysis from 
multiple intrusion detection systems. 

Compare with other analysis types in network security, monitoring analysis has 
higher requirements on timeliness. Analysis with large-scaled neural networks require 
expensive equipments to ensure the efficiency of computing process. Some algorithms 
such as K-NN and SVM are only applicable to analysis with small-scaled datasets. With 
random forest, the computation cost is at equivalent level as the cost of IDS. When 
considering timeliness, cost and efficiency, and datasets scale, random forest method is 
the best choice in practice of large-scale network monitoring analysis. 


2.3 Imbalanced Learning and Cost-Sensitive Learning 


From the perspective of machine learning, the characteristics of monitoring data are 
typical category imbalance and cost sensitive data. In the network traffic, the vast majority 
of traffic comes from normal network services, only a small part comes from intrusions. 

In this paper, alarm data is regarded as positive class, normal traffic is regarded as 
negative class. Without any data processing, after sampling the network traffic, we found 
that the ratio of alarm data to service data reached a level of 1:10° at most. There is a 
serious imbalance between positive and negative data, which will lead to the natural bias 
of classification algorithm towards negative data. 

Monitoring data is an important data related to network security. The consequences 
of incorrect classification of monitoring data are different, misreporting may not lead 
to direct consequences, but underreporting may lead to security vulnerabilities in 
actual monitoring. As shown in the Table 1, for the confusion matrix, the impact of 
underreporting is far greater than that of misreporting. 


Table 1. Classification result confusion matrix 


Real category Predictions 

1 0 
1 True Positive (TP) False Negative (FN) 
0 False Positive (FP) | True Negative (TN) 


There are data level methods and algorithm level methods to solve the class imbal- 
ance. The data level methods mainly include oversampling, undersampling and com- 
posite sampling. Among them, the disadvantage of undersampling is that it may cause 
the loss of information, while the disadvantage of oversampling is that it causes over 
fitting. The algorithm level method is mainly to modify the existing algorithm to pay 
more attention to the minority class. 
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In this paper, for the monitoring datasets, we mainly use the undersampling method. 
Specifically, we use two types of methods. First, we limit the recall channel and increase 
the proportion of positive samples as we keep the sampling comprehensiveness as much 
as possible. Second, based on the monitoring data itself, we screen data by several 
methods include filter white list data, remove data containing important business features, 
and remove normal network traffic in combination with external intelligence base. After 
undersampling, positive and negative classes form a data ratio within 1:100. 

In order to balance the cost of underreporting, we adjusted the weight of underre- 
porting in the learning process and increased the punishment. 


3 Application of Random Forest Algorithm 


3.1 Experiment Description 


In this paper, the number of service data (TN) is far more than that of other classes, to 
avoid the disturbance of service data, we use precision, recall and F-score to evaluate 
classifier performance. Precision is defined as TTP recall is defined as IPN 

From Table 1, FN stands for the number of underreporting, FP stands for the number 
of misreporting. Considering the importance of underreporting, we increase the weight 

241)PR . : 
of FN and define F-score as a in which a = 2. 

In this paper, we use three different combination strategies to combine classifiers, 
including relative majority voting, weighted voting and stacking. 

For weak classifier hı, h2, ...46 and collection of category tags {c1, c2, ...c6}, 
we express the prediction output of h; out of x as a 6-dimensional vector 
(hj (x), he (x), he (x)), let W(x) be the output of h; on category tag cj. 

Relative majority voting: 


Hœ) =c 6 (1) 
arg; max L kŒ) 
i=1 
Weighted voting (w; is the weight of h;): 
H(x) =c (2) 


6 A 
arg; max E wilt, (x) 

i=l 
Stacking: A new dataset is generated from the training results of the initial dataset as a 
training sample, which is called a secondary training set, then we generate secondary 
learners for training by cross validation. 
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3.2 Data Sampling and Preprocessing 


We select part of the network traffic through the recall channel for analysis. To ensure 
data comprehensiveness, we need to sample from the complete time period for forming 
a dataset. Table 2 shows the basic features of the dataset: 


Table 2. Features of sampling data 


Number of samples Number of positive Number of negative Features 


28,445,724 (8,558 28,437,166 30 


The dataset in Table 2 is generated and sampled from the full period in proportion 
based on the above features. In a complete period of one week, we observed that during 
working hours, the network traffic is large and mainly internal business data, while during 
night and holiday, the amount of network traffic data is relatively small, and the external 
network access data is the main data. After the dataset is formed, 30 data features are 
extracted from it combined with each intrusion detection device. 

Before we preprocess the data, the ratio of the number of positive classes to the 
number of negative classes reaches 1:3322, which would cause bias that the results of 
the model tend to be negative class and cannot be classified correctly when we directly 
classify on the dataset of Table 2. 

Therefore, we clean the dataset in Table 2 by filter the white list, clear the analyzed 
data in the security policy, remove the business characteristic data and analyze in com- 
bination with the external intelligence base. After above preprocess, Table 3 shows the 
features of the dataset: 


Table 3. Features of preprocessed data characteristics 


Number of samples Number of positive Number of negative Features 


606,267 8,558 597,709 30 


After the above preprocess, the ratio of positive and negative classes in the dataset 
in Table 2 is reduced to nearly 1:69 in Table 3. The following is a further analysis based 
on the dataset formed in Table 3. 


3.3 Classifier Analysis 


Combined with intrusion detection equipment, six weak classifiers are extracted from the 
dataset. By manually analyzing the real situation of positive classification and manually 
labeling, the actual performance and classification ability of each weak classifier are 
obtained. Details are shown in the Table 4: 

Further analysis based on the data in Table 4: 
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Table 4. Features of weak classifiers 


Identifier Number of positive Features True positive Precision 

1 3,608 15 2,273 63.00% 

2 1,550 11 783 50.51% 

3 3,685 19 677 18.37% 

4 1,904 17 329 17.27% 

5 1,812 16 382 21.08% 

6 849 9 176 20.73% 
(1) In Table 4, the first two classifiers correspond to Internet traffic, the last four classi- 


(2) 


(3) 


(4) 


fiers mainly correspond to each intranet service and terminal detection equipment. 
Due to the partial overlap of functions, there is a large amount of duplicate data 
between different classifiers. 

After removing the duplicate alarm classification, there are 4420 external network 
alarms, in which there are 2629 are correctly classified, the precision is relatively 
high, reaching 59.48%. The precision of each intranet service and terminal detection 
equipment is relatively low, there are 4138 alarms, in which there are 791 are 
correctly classified, the precision is 19.11%. 

On the whole, belong 8558 alarms, there are 3420 are correctly classified, the 
precision is 39.96%. Belong them, the external IP intrusion classifier has high pre- 
cision, which can identify most conventional intrusions, the most common intrusion 
includes port scanning, Weblogic attacks, deserialization attacks, and crawlers. The 
precision of Intranet service management and terminal alarm are relatively low, 
most of the false positives come from incorrect SQL injection identification, which 
is because different databases have different management strategies and release 
orders. 

Through further analysis of the duplicate data, it is found that for some attack 
features, special devices are required to classify correctly, which makes it possible 
for us to correctly schedule the dominant classifiers for detection and recognition 
by combining strategies. 


3.4 Combination Strategy 


We randomly divided the sample data into two subsets: a training dataset and a testing 
dataset. 70% of the total sample is used as training data to determine the optimal model 
parameters. The remaining 30% dataset is used as testing data to evaluate the predictive 
precision. In this paper, we use three different combination strategies for model training, 
including relative majority voting, weighted voting and stacking. Table 5 shows the 
classification results under different combination strategies. 


The recall rate in Table 5 reflects the number of underreporting of the combination 


strategy. In the dataset of this paper, 1% recall rate represents about 40 underreports. 
Therefore, from Table 5 we can see that using relative majority voting or weighted voting 
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Table 5. Classification result on different combination strategies 


Combination Strategy Precision Recall F-Score 
Relative majority voting 50.65% 91.15% 35.14% 
Weighted voting 48.98% 95.03% 34.68% 
Stacking 47.91% 98.25% 34.47% 


for the classifier would cover part of the feature recognition ability, which has certain 
destructiveness to the model when the amount of weak classifiers are limited. Compared 
with the above two methods, stacking method has higher performance in classification 
recall rate. Based on specific scenarios, when data features are relatively fixed, stacking 
can find the correct classification when weak classifiers conflicted with each other. 


4 Conclusion 


(1) In this paper, the safety monitoring dataset is generated by desensitizing the data 
from practical application and sampling in full cycle. After undersampling, we 
realize the relative balance of sample data. After data cleaning, the data proportion 
of the sample is reduced from 1:3322 to 1:69 without damage the features of dataset. 

(2) In practical security production, different intrusion detection devices have differ- 
ent feature recognition capabilities. After preprocessing, the overall classification 
precision was 39.96%. External IP intrusion is easier to be identified, and the clas- 
sification precision is 59.48%, which can be correctly identified by most detection 
devices. For intranet service management and terminal security detection, the detec- 
tion ability of the classifier is low, only 19.11%. In practical application, these kind 
of attacks are more necessary to rely on the corresponding equipment with the 
feature recognition ability to analyze and identify specific features. 

(3) Comparing with voting method, stacking has better performance to combine weak 
classifiers. After stacking, the precision of classification has increased to 47.91%. 
Limited by the feature recognition ability of the testing equipment, the precision of 
the model is limited in this dataset. It is necessary to introduce a new detection and 
recognition algorithm to greatly improve the precision of the model. Whether the 
classification precision can be further improved needs further research. 


The data application detection in this paper is mainly used for off-line analysis. For 
the real-time detection of monitoring and analysis, how to conduct real-time analysis 
through the stream processing engine, and how the detection efficiency and effect are, 
still pending further study and improvement. 
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Abstract. Network security is an important guarantee for Mega-projects 
approved for data clusters. It is necessary to comprehensively improve the network 
security awareness, monitoring, early warning, disposal and evaluation capabilities 
of Mega-projects approved for data clusters. It makes a comprehensive analysis 
on the network security issues in Mega-projects approved for data clusters from 
dimensions of computing facility security, network facility security, combination 
and scheduling security, network operation service security, data security, network 
situation awareness, etc. It is set up gradually evolving atomic power security capa- 
bilities for building a ubiquitous security network computing brain. It identifies 
data assets in an active and passive ways, sorts out data assets through in-depth 
scanning and information completion, supports the formation of preset templates 
according to AI (artificial intelligence) models, regular matching, keywords, com- 
bination rules, etc., classifies and grades data according to data sensitivity, and 
visually displays them in the form of charts. It forms a multi-layer architecture 
system that includes the collaborative scheduling of computing networks on the 
control side, the perception of network convergence on the data side, management 
and the scheduling of computing resources on the service side, realizes the inter- 
action and supervision of the whole process, all elements and the whole industry 
chain of computing scheduling, has functions of security perception, monitoring, 
early warning, disposal and evaluation, and improves the security perception and 
linkage monitoring capability of cross data center and clusters. Gradually, it builds 
a coordinated threat handling capability. 


Keywords: Mega-projects approved for data clusters - Network security - 
Network situation awareness - Network security computing brain - Atomic 
security capability - AI model 


1 Introduction 


“Mega-projects approved for data clusters” refers to building a new computing power 
network system integrating data center, cloud computing and big data [1] to orderly 
guide the computing power demand in the east to the West. According to the demand of 
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computing power, China promotes the echelon layout and overall development of data 
centers from east to west; Accelerates the gradual and rapid iteration of “Mega-projects 
approved for data clusters”. In order to comprehensively boost the development of new 
data centers, it builds an intelligent computing ecosystem with new data centers as the 
core, and gives full play to the enabling and driving role of the digital economy, the 
Ministry of industry and information technology has formulated and issued the three- 
year action plan for the development of new data centers (2021-2023) [2], and makes 
every effort to ensure the promotion of the “Mega-projects approved for data clusters” 
project. 

Network security is the premise for the development of “Mega-projects approved for 
data clusters”. The "Mega-projects approved for data clusters" project urgently needs 
to improve the ability of network security perception, monitoring, early warning, dis- 
posal and evaluation in an all-round way, accelerate the security protection level of data 
resources in the whole life cycle, improve the ability of computing power security mon- 
itoring and scientific scheduling, and cope with the transformation of network attacks 
from static analysis to dynamic perception, post disposal to prior prevention, single point 
prevention and control to global joint prevention. 

It is oriented to “Mega-projects approved for data clusters” and meets the scenarios 
of massive data processing and scientific computing; The training reasoning scenario of 
artificial intelligence model for east digital west training. Promote the successful imple- 
mentation of the project of “Mega-projects approved for data clusters”, accelerate the 
transformation of data centers, and provide new momentum for high-quality economic 
and social development. 


2 General Analysis of Computing Power security 


For the construction of the security system of the “Mega-projects approved for data 
clusters” project, it is necessary to refine the security assurance objectives, clarify the 
access standards for security technical means such as security situation monitoring, traffic 
protection and threat disposal, deepen policy reform measures and major engineering 
suggestions in terms of data resource protection and computing resource monitoring 
and scheduling, and promote the application security of data resource circulation. As 
the core task of the construction and application of “Mega-projects approved for data 
clusters”, network security focuses on building a multi-level collaborative supervision 
platform and monitoring system for basic networks, data centers, data center clusters, 
cloud platforms and application enterprises, and improving the ability of “Mega-projects 
approved for data clusters” project to serve economic operation monitoring and industrial 
digital transformation monitoring. 

The architecture of computing power network consists of three levels: computing 
power infrastructure, arrangement management and operation service. The infrastructure 
layer consists of computing infrastructure and network infrastructure to form a new 
computing network integration infrastructure, and build a flexible and agile computing 
base and a fully connected intelligent network at the cloud edge. The arrangement 
management layer realizes the unified arrangement and intelligence of the calculation 
network by building the brain of the calculation network. The operation service layer 
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creates a new operation service system and business model by using technologies such 
as computing power trading, multidimensional dimension and computing power grid 
connection. In this architecture, safety runs through the whole process, and improving 
safety endogenous capability has become an important development goal. This paper 
will analyze the relevant network security issues from the above dimensions. 

The overall goal is to build a network security value system of “Mega-projects 
approved for data clusters” and provide refined, ubiquitous and original twin security 
services; Build a ubiquitous security computing network brain, provide synchronous 
“pay as you go” security experience, transform application-based into task-based, and 
realize differentiated security experience of more refined process. Realize near source 


defense mode based on twin computing power mode, and realize super edge plus near 
source side defense mode (Fig. 1). 
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Value 3: realize near source defense mode based on twin computing power mode, 
and realize super edge + near source side defense mode 


Fig. 1. Ubiquitous security computing network brain 


With the rapid development of computing network technology and the continu- 


ous integration with Internet+, industrial Internet, big data, cloud computing and other 
new technologies, more and more information assets provide services with the help of 
Internet technology. At the micro security capability implementation level, they build 


a gradually evolving atomic capability security means to protect network security from 
all dimensions (Table 1). 
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Table 1. Atomic security capability table 


Name of atomic security capability 


Host asset discovery 


Atomic security capacity of corresponding 
resource pool 


Security asset management system -> host 
asset discovery 


Software asset identification 


Security asset management system -> 
software asset discovery 


Web vulnerability scanning 


Web vulnerability scanning 


Host vulnerability scanning 


Vulnerability scanning 


Web page tamper proof 


Web page tamper proof 


Weak password scanning 


Terminal detection and response -> weak 
password scanning 


Webshell scan 


Terminal detection and response -> webshell 
scanning 


Baseline configuration check 


Terminal detection and response -> safety 
baseline check 


Terminal access control 


Terminal detection and response -> host 
network access isolation 


Configuration reinforcement (consolidated 
patch management) 


Terminal detection and response -> 
configuration reinforcement 


Document monitoring and protection 


Terminal detection and response / Web page 
tamper proof 


Terminal data leakage detection 


Terminal detection and response -> terminal 
data leakage detection 


Access behavior audit 


Terminal detection and response -> access 
behavior audit 


Terminal intrusion detection protection 
(combining terminal threat detection and host 
intrusion detection) 


Terminal detection and response -> Terminal 
intrusion detection / protection 


Backup recovery 


Terminal detection and response -> backup 
and recovery 


Host Forensics 


Terminal detection and response -> host 
certificate 


Terminal antivirus 


Terminal detection and corresponding -> anti 
virus 


Network access control (combined network 
attack suppression) 


Next generation Firewall -> access control 


(continued) 
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Table 1. (continued) 


Name of atomic security capability 


Network address translation (NAT) 


Atomic security capacity of corresponding 
resource pool 


Firewall -> network address translation 


Network isolation switching 


Firewall -> network isolation switch 


Network Intrusion Prevention 


Next generation firewall 


Denial of service protection 


Anti denial of service system -> denial of 
service protection 


Sensitive data leakage prevention 


Intrusion protection system -> sensitive data 
protection 


Spam protection 


Mail Security Gateway -> spam protection 


Network virus defense 


Network virus defense 


Network threat detection 


intrusion detection system 


Network data leakage detection 


Intrusion detection system -> sensitive data 
outgoing detection 


Web application protection 


web application firewall 


Code audit 


Code audit system ->code audit 


Database audit 


Database audit 


Log audit 


Log audit 


Network security audit 


Network security audit 


Sensitive data identification 


Data security system -> sensitive data 
identification 


Desensitization of sensitive data 


Data security system -> sensitive data 
desensitization 


information service 


Threat Intelligence Platform -> intelligence 
service 


VPN access 


VPN 


Operation & Management access control 


Fortress -> access control 


Name of safe atomic capability 


Honeypot -> network attack entrapment 


3 Computing Facilities Securities 


Computing infrastructure includes cloud computing, edge computing and end comput- 
ing. While providing powerful computing technology support services for upper tier 
applications, it also faces many risks. It is necessary to build a comprehensive, system- 
atic and three-dimensional protection means for cloud computing, edge computing and 
end computing. 

In cloud computing, security protection should be provided for physics, virtual- 
ization, business, data, operation and maintenance management, etc. In terms of edge 
computing, security protection should be provided for network services, hardware 
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environment, virtualization, edge computing platform, applications, capacity opening, 
management, data, etc. 

In terms of end-to-end computing, security protection should be carried out for 
physics, virtualization, application, capacity opening, management, data, etc. 

At the same time, it is also necessary to do a good job in the security protection of 
cloud, edge and end interconnection, including identity authentication, traffic monitoring 
and audit, interface control, security situation monitoring and other security protection 
means. 

Simultaneously carry out the basic system planning of computing network secu- 
rity. Based on the independent collaborative evolution stage of the computing power 
network, strengthen the construction of basic atomic capabilities. Start the standardiza- 
tion of computing security, formulate standardized interfaces and access criteria, and 
solve the problems of self security and interoperability of computing network. Meet the 
personalized and distributed computing power needs of customers, conduct technical 
pre research and pilot demonstration, and adopt decentralized and security identifica- 
tion/security slicing technology to make the security capability compatible with the 
distribution of computing power; Research on the application of dynamic intelligent 
network slicing technology to ensure differentiated network service capability. 


4 Network Facility Security 


SRv6 (segment routing IPv6) simplifies the network protocol type, has good scalability 
and programmability, can meet the diversified needs of more new services, provides 
high reliability, and has a good application prospect in cloud services. SRv6 and the 
new generation SD-WAN (software defined wide area network) are the core technolo- 
gies to realize the convergence of computing and networking. The networking scheme 
combining the two can realize the network linkage between the backbone network and 
enterprise sites, and realize the interconnection and perception of computing power; 
Deterministic network technology provides quality of service guarantee for new ser- 
vices with ultra-large bandwidth, ultra-low delay and ultra-high reliability. However, the 
complex network environment, fuzzy security boundary and highly sensitive time delay 
have also brought new security challenges. 

Traditional security solutions do not have the good scalability and programmability of 
SRv6 and the performance, flexibility or interconnection required for SD-WAN connec- 
tion. The atomic security capability can support flexibility, interconnection, scalability 
and programmability, sense the changes of edge connections, and provide consistent 
policy implementation. This policy can isolate users, applications, workflows, or data 
based on many parameters to provide security over the entire transaction path. Traffic 
can be forced to follow specific behaviors, or isolated to specific users or destinations to 
ensure consistent policy application and execution. 


5 Arranging and Scheduling Security 


Facing the highly complex computing network environment, the arrangement manage- 
ment layer cooperatively schedules the resources of each domain of the computing 
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network according to the diversified and customized computing power requirements. 
The arrangement management layer perceives and cooperates with the arrangement 
of computing power users, computing tasks, network resources and computing power 
resources. The arrangement management shall have the ability to control the security of 
computing power and solve the problem of computing power abuse. The abuse of com- 
puting power includes illegal mining, violent cracking and other acts, which not only 
encroach on computing power resources, but also may use computing power to launch 
security attacks. Based on the self adaptation mode of the computing power network, 
establish the North-South linkage between security services and computing power, and 
promote the scheduling of security computing power. Considering the introduction of 
heterogeneous computing power nodes rather than completely self built, it is necessary 
to solve the identity and trust problems of computing power nodes, and conduct research 
and verification on technologies such as differential privacy and homomorphic encryp- 
tion during the interaction between algorithms and computing power. Carry out the pre 
research on node collaboration. The computing power is in multiple nodes. The nodes 
need to have a synchronization mechanism. The nodes need to adopt an adaptive and 
self-organizing architecture. The “edge by edge collaboration” mechanism is used for 
local interaction of capability and performance information. 


6 Operation Service Security 


Operation service security is mainly to ensure the security of computing network ser- 
vices, including identity security, operation security and integrated application security. 
Among them, identity security ensures that the identities of computing nodes and users 
in the computing power network can be identified and verified; The operation security 
realizes the functions of security transaction, security monitoring, security audit, etc. 
The integrated application security provides flexible, dynamic and end-to-end business 
security for differentiated application scenarios such as digital life, intelligent production 
and digital society. 


7 Data Security 


Data security [3] runs through all levels of the computing power network, mainly includ- 
ing data asset identification, data security protection, data flow security, computing secu- 
rity, East West training, etc. which can effectively ensure that the data is in an effective 
and legitimate use state in the whole life cycle. 


Data Asset Identification 

Data asset identification combines initiative and passivity to discover assets including 
servers, relational databases, non relational databases, interfaces, etc., and complete 
the completion of data asset attributes through information completion and in-depth 
scanning. From the perspective of data assets, data is obtained from SMC/SMP and 
data resource scanning discovery, and the data is classified and managed at different 
levels. The classification and classification list management function mainly includes 
data classification and classification list, important data list and sensitive data list. Real 


Brief Analysis for Network Security Issues in Mega-Projects 45 


time display of classification and classification data information of different dimensions, 
data sorting of identified asset data, classification and classification mapping of data 
according to data sensitivity, visual display in the form of charts, and controllable storage 
of warm and cold data. 

According to the data classification and grading rules of countries, industries or 
enterprises, preset templates can be formed according to AI models, regular matching, 
keywords, combination rules, etc. you can also configure classification and grading 
templates according to the needs of the current business. 


East Digital West Training 

The data value evolution path with knowledge as the core. Driven by technology, it 
develops artificial intelligence model training and reasoning, and constructs the over- 
all technical framework of “East digital West training”. Driven by technology, AI has 
become the base of new infrastructure technology, promoting the acceleration of artificial 
intelligence deployment. 

AI modeling is different from data development. It has no hierarchical modeling 
restrictions. At the same time, it opens the way of data reading and warehousing, and 
supports free modeling; In addition to the basic data processing components, it also 
has built-in rich machine learning algorithms. It also supports user-defined processing 
components to help dig deep into data value. 


Data Flow Security 

Data flow involves data aggregation, data transmission between providers and users, as 
well as the use of data out of the control of owners. Data will face greater security risks, 
including personal information disclosure, data vulnerable to attack and disclosure, ille- 
gal over collection, analysis and abuse of data, etc. During the data flow process, the data 
shall be identified, the data flow node, operation, flow direction and other information 
shall be recorded, and a unified cross domain and cross system data flow identification 
shall be established to realize that the data flow direction can be controlled and the data 
flow can be perceived. In order to monitor the flow of data in real time, it is necessary to 
strengthen network security monitoring through technical means, especially automated 
security monitoring, and comprehensively monitor and analyze the data sharing platform 
and system through traffic, logs, configuration files, etc., so as to facilitate early warning 
and collaborative defense of network security events, and improve the overall security 
situation awareness, security decision-making and other capabilities. 


8 Situational Awareness 


Situational awareness integrates detection, early warning, response and disposal func- 
tions, and is the safety brain in the active defense system. It plans the security capability of 
the integrated computing service system of “Mega-projects approved for data clusters”, 
integrates the existing data center security data, interoperability monitoring platform 
and supporting business systems, builds a data center level, data center cluster level 
and industry-wide computing security perception and monitoring platform, realizes the 
interaction and supervision of the whole process, all elements and the whole industry 
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chain of computing scheduling, and has the functions of security system [4] perception, 
monitoring, early warning, disposal, evaluation, etc., Improve the security awareness 
and linkage monitoring capability of cross data center and cross data center clusters. 
Gradually build a coordinated threat handling capability. 


8.1 Situation Awareness of Network Security Quality Based on Data Network 
Collaboration 


It will improve the monitoring system for computing network governance, promote the 
optimization of the network architecture and traffic routing of data centers in the eastern 
and western regions, promote the quality monitoring of data network collaboration, 
promote the networking of edge data centers, and continuously improve the network 
capacity of data centers (Fig. 2). 
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Fig. 2. Situation awareness of network security quality based on data network collaboration 


8.2 Situation Awareness of Computing Capacity Security Improvement 
Evaluation 


Take the cloud with the network and use the network to strengthen computing, realize 
the enhancement of computing power value based on the computing power network, 
and ensure the enhancement of computing power value with computing power secu- 
rity. Promote the development of computing power network from multiple demands, 
support the implementation of ubiquitous computing power with multiple technologies, 
and enhance the security value of computing power with multi-dimensional security 
situational awareness (Fig. 3). 
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Fig. 3. Situation awareness of computing power security improvement evaluation 


8.3 Situation Awareness of Industry Chain Security Enhancement Assessment 


Accelerate the key technology and product innovation of the new data center operation 
security management and other software layers, as well as the cloud native and cloud edge 
integration security and other platform layers, and improve the software and hardware 
synergy; Establish and improve the new data center security standard system; Draw the 
security map of the whole industry chain of the new data center, promote the completion 
of key links, and carry out the security capability evaluation of the new data center 


(Fig. 4). 
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Fig. 4. Industry chain security enhancement assessment situation awareness 
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8.4 Situation Awareness of Green Low Carbon Assessment 


The continuous deepening of the national “double carbon” strategy has put forward 
higher requirements for the green and low-carbon level of the data center industry, and 
the PUE, cue and other energy efficiency indicators are more strictly restricted. “Mega- 
projects approved for data clusters” is a powerful driving scheme for the data center 
to achieve “carbon neutralization and carbon peak”. The collection and evaluation of 
energy consumption indicators saved after “Mega-projects approved for data clusters” 
can be used as one of the dimensions to evaluate the situation awareness of “Mega- 
projects approved for data clusters”. In order to quickly achieve the “double carbon” 
goal, implement the notice of the Ministry of industry and information technology on 
printing and distributing the three-year action plan for the development of new data 
centers (2021-2023), and optimize the green development of the data center industry 
chain, it is necessary to establish and improve the green data center standard system 
(Fig. 5). 
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Fig. 5. Green low carbon assessment situational awareness 


8.5 Situation Awareness of Security Assurance Assessment 


In the important supporting support construction scheme of “Mega-projects approved 
for data clusters”, it is clearly emphasized that from the aspects of data risk identification 
and protection, data security compliance assessment, to data encryption protection and 
related technical monitoring, it is necessary to “synchronously plan, construct and use 
security technical measures to ensure business stability and data security (Fig. 6). 
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Fig. 6. Security assurance assessment situation awareness 


8.6 Threat Collaborative Disposal Scenario 


Based on the new data center security monitoring means, facing the "Mega-projects 
approved for data clusters" network threat collaborative disposal scenario [5], carry out 
closed-loop disposal and collaborative linkage of threat disposal, deposit the network 
security risk case base, emergency drill scenario base, emergency disposal plan base, 
emergency disposal expert base and emergency response tool set base, promote the 
transformation of threat disposal to risk early warning and pre prevention, and improve 
the scientificity, accuracy and timeliness of threat disposal, We will strengthen capacity- 
building for coordinated disposal. 


9 Conclusion 


The “Mega-projects approved for data clusters” project realizes “the network moves 
with the cloud, and the cloud moves with the needs”, forming a multi-layer architecture 
system including the computer network collaborative scheduling of the control plane, 
the network fusion perception and management of the data plane, and the arrangement 
of computing resources of the service plane. 

The new architecture, new technologies and new services of the “Mega-projects 
approved for data clusters” network may have new security risks that need to be over- 
come, and need to be guaranteed by a new security mechanism adapted to it. There 
are potentially complex network risks and computing power node security risks in the 
infrastructure layer. The scheduling management layer involves scheduling security risks 
and computing power use out of control. The operation service layer faces problems 
such as accessing malicious nodes, untrusted transactions, insecure applications, etc. in 
addition, there may be data security risks such as uncontrollable data flow in the "Mega- 
projects approved for data clusters" network, which needs to be strengthened through 
an integrated whole process trusted mechanism. 

This paper makes a comprehensive analysis on the network security problems in 
“Mega-projects approved for data clusters” from the aspects of computing power facility 
security, network facility security, scheduling security, operation service security, data 
security, situation awareness and so on. Itis proposed to build a network-based ubiquitous 
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endogenous security system with ubiquitous security computing brain as the core, atomic 
security capability as the foothold, and intelligent orchestration as the link. 

Guided by the security application, oriented to the new business mode of clus- 
ter scheduling, combined with the existing traffic protection and security monitoring 
means in the data center, it focuses on the realization of network security quality situa- 
tional awareness, computing power security improvement assessment situational aware- 
ness, industrial chain security enhancement assessment situational awareness, green 
low-carbon assessment situational awareness, security assurance assessment situational 
awareness and other assessment systems for data network collaboration. 

And then promote the network convergence, transmission, storage and integration 
application links for the cluster nodes of the data center to carry out the construction 
of traffic protection security means; In combination with active detection means and 
detection work, implement the construction of security situation capability and build the 
capability of threat collaborative disposal. 
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Abstract. With the continuous growth of enterprises’ digital transfor- 
mation, business-driven cloud computing has seen tremendous growth. 
The security community has proposed a large body of technical mech- 
anisms, operational processes, and practical solutions to achieve cloud 
security. In addition, diverse jurisdictions also present regulatory require- 
ments on data protection to mitigate possible risks, for instance, unau- 
thorized access, data leakage, sensitive information and privacy disclo- 
sure. In view of this, several practical standards, frameworks, and best 
practices in the industry are proposed to evaluate and improve the pro- 
tection level of cloud data. However, few evaluation models can conduct 
a comprehensive quantitative evaluation for cloud data protection that 
includes security, privacy, and even ethical considerations. In this paper, 
we first make a comprehensive review of cloud data security and pri- 
vacy issues, especially also including ethical concerns that we consider 
as a type of specific risks caused by human factors, which refers to act- 
ing honorably, honestly, justly, and legally, due diligence, and due care. 
Then, we propose a novel evaluation model for cloud data protection 
that can quantitatively assess the protection level. Finally, based on the 
parallel evaluation between manual assessment by experts and our eval- 
uation model, results show that our evaluation model is consistent with 
the manual evaluation conclusion. 


Keywords: Cloud data protection - Evaluation model - Security - 
Privacy - Ethics 


1 Introduction 


With the rapid improvement of cloud computing, the cloud offers flexible and 
affordable software, platforms, infrastructure, and storage available to organi- 
zations across all industries. Faced with limited budgets and increasing growth 
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demands, cloud computing presents an opportunity for organizations to reduce 
costs, increase flexibility, and improve IT capability [14]. Despite the rapid adop- 
tion of cloud computing, security and privacy remain key issues for the security 
community [20,25]. Although cloud service providers like Amazon Web Services 
(AWS), Microsoft Azure, and Google Cloud Platform (GCP) continue to expand 
security services to protect their evolving cloud platforms, security and privacy 
are ever-lasting considerations while migrating traditional IT to cloud [10]. 

As we all know, cyberspace is not peaceful, and both external advanced 
persistent threats (APTs) and insider attacks still occur from time to time. Since 
cloud environments often contain a variety of tenants and their vast amounts of 
valuable data, cloud platforms are also targeted by cyber threat actors [32-34]. 
For external APTs, attackers are always looking for new attack surfaces in the 
cloud to bypass existing security controls [41]. Insider attacks are often acted by 
disgruntled insider employees, who have limited authorized access and tend to 
exfiltrate sensitive data or escalate privilege intentionally. This is also an ethical 
issue due to the human factor instead of a technical issue. 

The implementation of cloud migration by enterprises means losing physical 
control of systems and data, thus it requires an assessment method to evaluate 
the protection level of cloud environments, including cloud data. Although many 
standards, frameworks, and best practices have been proposed by the security 
community and industry, there is rarely a comprehensive evaluation model that 
can quantitatively analyze the score of the protection level of cloud data that 
fully considers security, privacy, and ethical issues. In summary, this paper makes 
the following contributions: 


— We are the first to make a comprehensive review of every aspect of cloud data 
protection based on our full knowledge of security, privacy, and ethics issues, 
which consists of technological mechanisms, operational policies, and legal & 
regulatory compliance. 

— We present a novel algorithm to compute the score of protection level based 
on our insight about important factors that affect cloud data protection, that 
is, intra-phase, inter-phase, lifecycle operations, and compliance. 

— We propose an empirical evaluation model to assess the overall protection 
level of cloud data based on the score of each factor that affects the protection 
level. 


2 Overview of Cloud Data Protection 


This section introduces the methodology related to cloud data protection. We 
leverage the Data States Model and Cloud Data Lifecycle Model to summarize 
major security and privacy controls from a top-level perspective. More considera- 
tions on fine-grained controls including technique measures, operational policies, 
and legal & regulatory compliance will be discussed in the rest of the paper. 
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Fig. 1. The data states model in IT systems. 


2.1 Data State Model 


Data, as a type of critical asset, exists in one of three states both on-premise and in 
the cloud, including while it is at rest, in transit, and in use [26]. Regardless of the 
state of the data, IT systems should implement appropriate controls to protect the 
data and mitigate security and privacy risks [43]. Figure 1 shows the Data State 
Model, including three states of data and transformation between them. Data in 
use can be converted to both in-transit state and at rest, however, data at rest 
cannot be changed to in-transit state directly and vice versa. It is worth noting 
that this characteristic of conversion between data states depends on the classical 
Von Neumann architecture which is still the major one all over the world, other 
computing architectures e.g. quantum computing are out of our scope. 


Data in Use, refers to any data in the main memory or other caches while 
an application is using it. Due to the multitasking and concurrent features of 
modern information systems, it is important to ensure authorized access to data 
in use. Operating System (OS) built-in process isolation and application-level 
sandbox are primary controls for data in memory and cache. However, emerging 
attack vectors often try to bypass existing security mechanisms by vulnerabil- 
ities exploitation or advanced impersonation techniques. To this end, pieces of 
research attempt to leverage homomorphic encryption [1]. This limits the risk of 
data leakage because memory doesn’t hold unencrypted data. 


Data at Rest, aka data on storage, is any data stored on media, such as 
hard drives, external USB drives, network attached storage (NAS), and stor- 
age area network (SAN). The major risks it faces include data exfiltration, 
integrity breaches, unavailability (e.g. Denial of Service i.e. DoS). Strong sym- 
metric encryption is the key control of data at rest for security and privacy 
concerns. In Addition, as a compensating control, data redundancy can improve 
the high availability (HA) of data. Furthermore, strict authentication and autho- 
rization controls [30] can also help prevent unauthorized access. 
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Table 1. Cloud data lifecycle and representative controls 


Six phases of cloud data lifecycle 


CREATE STORE USE SHARE ARCHIVE DESTROY 
In Use x x x x x 
At Rest x x x x x 
In Transit | x x x x 
Controls | - Data classification |- Encryption - Virtualization - DRM/IRM - Encryption. - Crypto-shredding 
-IPSec/TLS VPN |- Key mgt. - DRM/IRM -IPSec/TLS VPN |- Key mgt. 
- DLP - IPSec/TLS VPN 


- IPSec/TLS VPN 


Data in Transit, also called data in motion, refers to any data transmitted 
over a network. The exchange of data between information infrastructures almost 
entirely depends on the transmission network in cyberspace. In particular, unlike 
the traditional on-premises model using the internal local networks, if enterprises 
migrate their IT systems to the cloud, all data access will be transferred over 
the Internet. Therefore, data in transit is more likely to be the target of cyber 
attacks than the other two data states. Leveraging a combination of symmetric 
and asymmetric encryption can protect data in transit generally. 


2.2 Cloud Data Lifecycle 


Data in cloud is constantly being created, stored, used, and transmitted, and 
once the data is no longer valuable, it needs to be destroyed. Unlike other valu- 
able physical assets, the value of data is time-sensitive and specific, thus data 
protection is sophisticated. Cloud Data Lifecycle Model provides a generic app- 
roach to identifying the broad categories of risks facing the data and associated 
security or privacy controls, therefore this allows us to consider threats, vulnera- 
bilities, and risks of cloud data at a higher level of abstraction in case of getting 
bogged down in the concrete details of a specific organization. 

Table 1 illustrates each phase of this model and corresponding representa- 
tive controls. Noting that the cloud data lifecycle is not always iterative, on the 
contrary, it is not constantly linear, sometimes even exists in multiple phases 
simultaneously. As an example, data being shared may be used and stored at 
the same time if co-workers collaborate in a Software as a Service (SaaS) app. 
Furthermore, data in a phase can also exist in multiple states. Regardless, data 
should be protected at every stage with security and privacy controls commen- 
surate with its value [18]. 


— Create. Data can be created by a user at on-premises or legacy workstations 
and then transferred to the cloud, or created directly in the cloud. Data 
classification [13] is the most important security control at this stage. In 
addition, if the data is created remotely, transportation security mechanisms 
such as IPSec/TLS VPN [22] are necessary solutions. 

— Store. Typically, this phase is synchronized with the creation phase, and 
encryption is introduced to mitigate threats exposed in the data center of 
the cloud environment. Meanwhile, for any cryptosystem, key management is 
also a critical security control. 
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— Use. Unlike data access on-premises, accessing cloud data remotely requires 
more additional security and privacy controls, including: (1) implementation 
of virtualization protection controls [44] to ensure that there is no unautho- 
rized access between different guests hosted on the same server; (2) leverag- 
ing Digital Rights Management (DRM) aka Information Rights Management 
(IRM) solutions [40] to implement fine-grained dynamic access control; (3) 
conducting continuous monitoring via Data Loss Prevention (DLP) solutions 
[21] which is also used for data classification in create phase; and (4) to 
enable network security transmission mechanisms, such as aforementioned 
IPSec/TLS VPN. 

— Share. This phase is relatively self-explanatory, that is, data is granted by its 
owner to other users or entities that require access. If the data has been highly 
classified, more accurate and fine-grained access control rules can be provided 
for data sharing. Conversely, additional controls are required to conduct data 
protection, many of controls implemented in prior phases will be effective here, 
such as DRM/IRM solutions, IPSec/TLS VPN, and so forth. Furthermore, 
due to distributed cloud data centers, data can be located in data centers in 
different jurisdictions. Thus several restrictions may exist in accordance with 
regulatory mandates based on the location of the data center. 

— Archive. As an integral part of the cloud data life cycle, archiving data 
is used for: (1) Business Continuity/Disaster Recovery (BC/DR) [4,29]; (2) 
Data retention and audit [12]; (3) eDiscovery [28]; and other (3) compliance 
requirements. Similar to the storage phase, the primary security control in 
this phase is encryption. In addition, due to long timeframes for storage in 
archives, there may be additional concerns related to availability. 

— Destroy. Data that is no longer useful and is no longer subject to retention 
requirements should be securely destroyed. There are many options for data 
destruction of legacy IT environments or on-premises, e.g. deletion, overwrit- 
ing, degaussing, etc. However, in the cloud environment, due to data dis- 
persion techniques, there is only one choice to destroy data, which is crypto 
shredding aka cryptographic erasure [38]. This mechanism refers to encrypting 
data leveraging one strong encryption engine, then using the key generated 
by that process, encrypting the data on another different encryption engine, 
and destroying the key thereafter. 


2.3 Shared Responsibility Model 


Cloud computing is a business-driven computing model rather than technology- 
driven, thus the interests of cloud service providers (CSPs) and cloud service cus- 
tomers (CSCs) are not always aligned. CSCs want maximum computing capabil- 
ities at the lowest cost. On the other hand, CSPs want to provide as few services 
as possible while maximizing profits. In this paper, we don’t review the cloud 
computing reference model here, which is clearly defined in the ISO/IEC 17789 
[24]. Fortunately, despite the adversarial relationship existing between the two 
sides, the interests of security and privacy on both sides converge. One example 
is that a data breach of a CSC caused by vulnerabilities of the infrastructure 
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Table 2. The shared responsibility model 


TaaS | PaaS | SaaS 
Data access security C C C 


Application security 


Virtualized infrastructure security 


C 
Platform security C 
S 
P 


wlw uja 
wlw wla 


Physical infrastructure security 


in a CSP will bring both parties to suffer brand and reputation damage, lower 
profits, and even face ongoing lawsuits. 

The Cloud Shared Responsibility Model [16] clarifies the clear responsibilities 
of both the CSPs and CSCs for defense-in-depth of cloud architecture. Table 2 
shows details of this model, where the rows and columns represent the layers 
and cloud service models of the cloud architecture, respectively. In Table 2, cells 
marking C, S, and P indicate the responsibilities of CSC, Both, and CSP, respec- 
tively. It is worth mentioning that although we only list the most common three 
cloud service models here, namely Infrastructure as a Service (IaaS), Platform 
as a Service (PaaS), and Software as a Service (SaaS), which are also defined 
in ISO/IEC 17789 [24], other service models have similar shared responsibility 
model. We are particularly concerned that regardless of the service model, the 
responsibility for data access security attributes to the CSC. This means that 
the ultimate responsibility for any data breach should be borne by the CSC. 
Of course, the CSC also has the right to seek compensation from the CSP. Due 
to the elementary principle of “layered defenses” in information security, both 
CSCs and CSPs need to implement security and privacy controls at different lay- 
ers to protect data. In a nutshell, this paper doesn’t intend to clearly distinguish 
which party is responsible for the implemented security controls, which does not 
influence our evaluation of security, privacy, and ethics of controls. 


3 Techniques, Operations, and Compliance 


The practice of industry in the past decade shows that a large body of previous 
excellent approaches, mechanisms, and tools in traditional IT has been intro- 
duced for better building the foundation of cloud computing. Therefore, the 
cloud and traditional IT also share most of the security controls to secure sys- 
tems and data. These controls usually include three aspects, namely techniques, 
operational activities, and compliance. We will discuss security and privacy con- 
trols for cloud data protection, along with the possible risks and how to mitigate 
them. Furthermore, ethical considerations are also discussed, which can be also 
considered as a risk in essence, and refer to acting honorably, honestly, justly, 
and legally, due diligence, and due care. 
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The rest of this section will discuss cloud-specific key security and privacy 
controls involving three aspects i.e. technical mechanisms, operational policies, 
and legal compliance. 


3.1 Technological Mechanisms 


Unlike on-premises, the cloud environment can rarely protect data by implement- 
ing strong access controls at clear boundaries, thus encryption is the primary 
option for protecting data. It is known that cloud computing is usually multi- 
tenant, and even with the deployment model of private cloud, there are also 
conflicts of interest among different departments within the same organization. 
Therefore, data obfuscation mechanism is an important security and privacy con- 
trol. In addition, virtualization techniques and corresponding security controls, 
as elementary cloud infrastructure, face several cloud-specific risks. 


1) Encryption and Key Management 

It should come as no surprise that cloud computing has a deep dependency 
on encryption, and no matter what state the data is in, without encryption 
technology, it is impossible to use cloud computing technology in any secure 
way. Due to the criticality of encryption organizations should concentrate their 
efforts on correctly implementing and deploying cryptographic systems, while 
key management is the area of greatest concern. If an organization uses multiple 
CSPs or intends to hold physical control over cryptographic keys, one solution is 
to escrow keys within the organization, but this requires additional infrastructure 
and personnel. Another way is to escrow keys to a third party, such as the 
prevailing Cloud Access Security Broker (CASB) [3], which is a service that 
provides key management and unified cloud data access control. 

Despite all the efforts to encrypt data in the cloud, there are still risks that 
make us have to strike a balance. First, encryption can be done at different lay- 
ers and granularities, such as volume-level, object-level, file-level, application- 
level, and so forth [42]. For the performance reason, it is difficult to implement 
strong encryption at all layers. As an example, despite implementing volume- 
level encryption, which is used to be connected to a virtual machine (VM) 
instance, it is still vulnerable if an attacker gains access to the VM instance. 
Second, IT administrators or security staff may be necessary to access other 
personnel’s cryptographic keys for key recovery or other reasons. If a disgrun- 
tled employee gets the key, it will increase the risk of unauthorized access. This 
is also an ethical issue. Third, despite not a good practice, CSCs, for techni- 
cal or budgetary reasons, also escrow keys to the same CSP that also stores 
the organization’s data. This risk of dependency is also termed Lock-In, which 
will be discussed in Sect. 3.2. Finally, due to legal and regulatory requirements 
for specific encryption algorithms or methods, there is a security gap between 
different jurisdictions where data has transborder exchange. This issue will be 
detailed in Sect. 3.3. 
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Fig. 2. Overview of tokenization mechanism. 


2) Data Obfuscation and De-identification 

Concerning security and privacy, practical cloud data protection is necessary to 
obscure sensitive data or instead use a representation of that data. Masking 
is an elementary data hiding technique (e.g., showing only the last four digits 
of a credit card number), and similar techniques include randomization which 
replaces part of the data with random characters, and shuffling that represents 
the data with different records within the same dataset. Tokenization is another 
privacy protection technique which is illustrated in Fig. 2, a nonsensitive tag 
called a token is created as a substitute to be used in place of sensitive data. The 
implementation of tokenization typically consists of two databases, one storing 
actual and real sensitive data, and the other storing tokens corresponding to each 
data entry. A user who needs to access data first obtains nonsensitive tokens, and 
then a strong access control mechanism such as Identity and Access Management 
(IAM) [15] decides whether this user can access the corresponding sensitive data 
entries. Anonymization is the primary technique for de-identifying when the 
data contains Personally Identifiable Information (PII). This process includes 
removing direct identifiers e.g. names, bank accounts, and indirect identifiers 
which are often statistical or demographic information but can be combined to 
infer PII e.g. personal age and shopping history [6,8,31,37]. 

There are three major risks during the implementation of the aforementioned 
security and privacy controls. First, the above techniques can perform well for 
structured data but may present problems on unstructured data that could be 
located in any media. Although the existing available solution is DLP and con- 
tinuous monitoring, this is still not enough to address the challenge of sensitive 
data mining. Second, the tokenization technique depends on the access control 
mechanism, thus we have to face all the risks, and the human factor is always the 
most significant risk among those. This is also an ethical dilemma. Last, although 
most privacy regulations require data anonymization or de-identification for any 
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PII use outside of live production environments, how to identify indirect iden- 
tifiers is also a hard nut to crack due to the lack of effective rules to estimate 
whether the information is an indirect identifier that seems humble but can be 
combined with other information to infer PII. 


3) Virtualization 

Proverbially, virtualization technology is the cornerstone of cloud computing, 
which helps the cloud to implement critically acclaimed on-demand services and 
resource pooling. In a sense, virtualization is also a security control that achieves 
access control through the isolation of diverse layers. Despite all the convenience, 
risks need to be considered while protecting cloud data in practice using virtu- 
alization. First, since the hypervisor that manages VM instances is the critical 
component of the virtualization solution, it tends to be attacked. Compromis- 
ing a VM instance only results in the data breach within the VM guest, thus 
threat actors may instead attempt to compromise the hypervisor. Because the 
hypervisor acts as the interface and controller between the virtualized instances 
and the host resources, exploiting the hypervisor can affect the security of all 
VM guests [7]. Another risk is guest escape. Weakly designed or configured VM 
instances or hypervisors may allow users to break restrictions and leave their 
own VM instances to gain unauthorized access. There are two ways of guest 
escape, one is lateral movement, that is, unauthorized access from one VM guest 
to another one, and the other is vertical movement, that is from one VM guest 
a user obtains the host machine permissions. As a matter of fact, the second 
way is more harmful to cloud data protection. Finally, since the cloud environ- 
ment is multi-tenant, we have to deal with data seizure issues. Legal activity 
may result in the seizure or inspection of the host machine which has hundreds 
of VM instances belonging to different CSCs by law enforcement agencies or 
plaintiff attorneys, even if the organization is not the target. Great efforts still 
need to be made to cope with this problem by both the security community and 
the judicial community [39]. 


3.2 Operational Policies 


While technical controls have laid the foundation for mitigating cloud risks, 
security operations in the cloud provide ongoing security and privacy assurance. 
This section will discuss several key controls in security operations. Due to space 
reasons, we will not go into all the details here, but more policies of security 
operations analysis. 


1) Data Classification 

Data identification and classification are the foundation of cloud data protection. 
All implemented technical and administrative controls determine the level of pro- 
tection based on the classification of data. Since the organization is constantly 
creating data in its operations, this is an operational process, aka “Data Discov- 
ery” [17]. Typical approaches to data discovery include label-based, metadata- 
based, and content-based ones. Whether the data is created on-premises or in 
the cloud, an assistant tool to data classification is DLP, a technology system 


60 R. Mei et al. 


designed to identify, inventory, and control the use of data that an organization 
deems sensitive, regardless of whether it is employees’ personal data, such as web 
browsing history, pending resignation letters, and so forth. In a nutshell, data 
discovery can sometimes be a “double-edged sword”, raising privacy and ethical 
concerns. 


2) DRM/IRM 

Data is out of the physical hold of the organization in the cloud, thus com- 
pensating controls are needed to protect the data during its lifecycle, especially 
during the use and share phases. DRM aka IRM which is mentioned in Sect. 2.2 
is an ideal mechanism [23,27]. DRM/IRM usually has the following advantages: 
(1) persistent protection, which follows the information it protects, regardless 
of where it is located; (2) dynamic policy control, which allows data owners to 
modify access control lists (ACLs) and permissions for the protected data under 
their control; (3) remote rights revocation, which the data owner can revoke 
permissions at any time; (4) continuous auditing, that allow for comprehensive 
monitoring of the access history. 

Despite the many advantages of DRM/IRM, leveraging DRM/IRM in the 
cloud still faces some challenges. One is replication restrictions. DRM/TIRM 
involves permissions for replication and sharing, but the administrative process 
in the cloud environment often requires creating, shutting down, moving, and 
backing up VM instances, which is undoubtedly in conflict with the policies of 
DRM/IRM. The other is jurisdictional conflicts. The blurred physical interface 
brought about by cloud computing will bring about the transborder flow or even 
out control of a large amount of data, which will lead to regulatory restrictions 
in different jurisdictions. 


3) Continuous Monitoring 

Automated or continuous monitoring and reporting is an important mechanism 
for cloud computing to achieve its capability of self-service. The monitoring 
objects mainly include: (1) physical environment, involving the temperature, 
moderation, and so forth of the data center; (2) host-level, including the perfor- 
mance and event tracing of the operating system, middleware, and applications; 
(3) network-level, refers to monitoring various network components, not only 
hardware and software but also cabling, Software Defined Network (SDN), and 
control plane. 

Continuous monitoring can improve performance and enhance security, how- 
ever, it can also raise privacy and ethical issues. As an example, the CSP collects 
the event tracking log of the operating system through the agent installed in the 
VM instance, so as to obtain the VM guest status and implement anomaly 
detection of the system. Although the system event log does not contain direct 
identifiers i.e. PII, the user’s behavioral characteristics can still be analyzed 
by reasoning about system events. Thus those auditing data can be used for 
precision marketing, and in the worst-case obtained by cyber threat actors to 
understand user behavior so that they can prepare proper attack vectors. 
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3.3 Legal and Regulatory Compliance - LRC 


Since the essence of cloud computing is to drive business improvement, many 
of its features such as decentralization and multi-tenancy make it difficult to 
comply with existing data privacy protection and other laws and regulations. 


1) eDiscovery 

eDiscovery refers to the process of identifying and obtaining electronic evidence 
for either prosecutorial or litigation purposes. Since cloud computing is often 
multi-tenant, it is more difficult to find data owned by one CSC without invading 
data from other CSCs that may reside on the same storage volume, drive, or 
physical machine. In addition, from a judicial point of view, all evidence needs to 
be tracked and monitored from the time it is recognized as evidence and acquired 
for that purpose, which is also called chain of custody. While the design of cloud 
computing may dynamically allocate and recycle resources for other tenants in 
the same storage location, which is in conflict with judicial principles. Thus 
when creating security and privacy policies for maintaining a chain of custody 
or conducting activities requiring the preservation and monitoring of evidence, 
we need to comply with the regulations. 


2) Diverse Jurisdictions 

A great deal of the difficulties in compliance with the legal and regulation of 
cloud computing stems from the design of cloud computing. They are often 
dispersed, often across the county, state, and even international borders. As 
mentioned earlier, transborder transfer of data is the most difficult reason for 
cloud to comply with laws and regulations. The governance of compliance must 
take all of the applied laws and regulations into account to operate reasonably 
with an understanding of legal risks and liabilities in the cloud. 


4 Empirical Evaluation Model 


This section will detail our proposed evaluation model for cloud data protection. 
By analyzing the aforementioned factors that affect cloud data security and 
privacy, we present a novel algorithm to quantitatively calculate the score of 
cloud data protection in a specific organization, and show its overall protection 
level. 


4.1 Important Factors 


We consider four factors to be important when assessing the protection level of 
cloud data in an organization. 


— Intra-phase. As mentioned in Sect. 2.2, a variety of security and privacy 
controls are implemented at each phase of the cloud data lifecycle. In Table 1 
we can see, even within one phase, there may be multiple data states, which 
means that more controls need to be implemented so that data in different 
states are protected. To this end, the more states of the data within a phase, 
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the more assessment is required. Furthermore, data in transit is generally 
more vulnerable to compromise than data in storage and at rest. Due to the 
fact that data in transit on public carriers is beyond the confines of the cloud 
itself. Based on this intuition, we prefer to assign a higher weight to phases 
with multiple states or containing the transit state in our evaluation model. 
Inter-phase. Over time, the more phases cloud data passes through in its 
lifecycle, the more opportunities it has to be processed, used, and shared. In 
other words, data in later phases are more critical than in former ones, thus 
more controls need to be implemented in later phases. As an extreme example, 
if the data is not properly destroyed during the destruction phase, such as 
using an insecure method (e.g., simply using the OS delete or rm command), 
or if the encryption key is leaked despite leveraging the recommended method 
of crypto shredding, the organization will eventually lose control of the data 
that they think this should be destroyed but can still be accessed. In short, 
the later phases of the evaluation deserve more attention. 

Lifecycle operations. Cloud security operations are the guarantee to man- 
age and mitigate cloud data risks within an acceptable range. Although many 
aspects of security operations are handled by CSPs, that is to say, it is invisi- 
ble and imperceptible to the CSCs, we should understand that cloud security 
operations include many global security controls during the whole cloud data 
lifecycle, such as BC/DR and continuous monitoring mentioned in Sect. 3.2. 
Our insight is that the cloud security operations should be considered as 
security and privacy measures applied to each phase of the cloud data life- 
cycle, and therefore should be given global priority and proper weight in the 
evaluation model. 

Compliance. Compliance requirements are an important aspect of Informa- 
tion Security Management System (ISMS) [2,5], and there is no doubt that 
the use of cloud computing presents many challenges in identifying and com- 
plying with compliance in specific jurisdictions. In typical cloud scenarios, 
since resources are allocated dynamically, CSCs do not know the exact phys- 
ical location of the data. In fact, even CSPs may not know which location 
the data is in at all times when they manage the VM images and other data, 
depending on the level of automation and data center design. Similar to secu- 
rity operations, compliance requirements should also be fully evaluated in the 
assessment model as a global factor. 


Hence, to conduct a comprehensive evaluation of cloud data protection, we 


need to first calculate the scores for intra-phase, inter-phase, operations and 
compliance, respectively. 


4.2 Quantitative Analysis 


First, to calculate the score of intra-phase, we define the weights based on dif- 
ferent data states within a phase, as shown in Table 3. We prioritize the three 
data states of in transit, at rest, and in use according to its possibility of risk 
occurring discussed in Sect. 4.1, and construct a binary truth table. In the last 


Considerations on Evaluation of Practical Cloud Data Protection 63 


Table 3. Intra-phase weight rating scale 


Phases Data states Weight 
In transit At rest In use 
CREATE | x x x 7 (111 
STORE x 2 (010) 
USE x x x 7 (111 
SHARE x 5 (101 
ARCHIVE x 7 (111) 
DESTROY x 3 (011 


Table 4. Severity levels of possible risks rating scale 


Severity level | Quantitative range | Rounded up average value 
Low 0.1-3.9 2.0 

Medium 4.0-6.9 6.0 

High 7.0-8.9 8.0 

Critical 9.0-10.0 10.0 


column of Table 3, we can see the weight values generated in different phases due 
to the existence of different data states, which in parentheses is the binary rep- 
resentation. We further use the weight of intra-phase to compute its protection 
score, which is defined as: 


toe 1 
IntraS = 5” o> wi x ———) (1) 


= max(r;) 


where w; is the weight of i phase of the cloud data lifecycle defined in Table 3, 
and r; means the severity levels of possible risk in the 7 phase, which can be 
found in industry standards or best practices. Since the possible risk within the 
phase with diverse data states is usually technical issues, we identify the risks 
and assign their severity levels based on the Common Attack Pattern Enumer- 
ation and Classification (CAPEC) list defined by US-CERT and DHS with the 
collaboration of MITRE [35]. For quantitative calculation, we map the sever- 
ity level to a numerical value based on the conversion table shown in Table 4 
included in the Common Vulnerability Scoring System (CVSS) [19]. Thus, we 
select the highest value of identified risk and then obtain the intra-phase score. 

Next, we define the formula for calculating the score of protection level of 


inter-phase. 
6 


1 
InterS = — x 
nter 5 02 
where w’; = (10+i)/10 for the phase of cloud data lifecycle i, which takes into 
account slightly higher weights for later phases mentioned in Sect. 4.1. Similar 
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to the formula of intra-phase score, r; also means the possible risks during the 
whole lifecycle of cloud data. We use CAPEC and Cloud Controls Matrix (CCM) 
published by Cloud Security Alliance (CSA) [9] to identify more possible risks. 
Noting that our evaluation model has the capability of customization mecha- 
nism, whereby risk identification model and severity level can be substituted to 
expert-specified others. Thus, the risk of the highest severity value is identified 
to represent the most critical risk in each phase and calculate the sum as the 
inter-phase score. 
Then, to calculate lifecycle operations score, we define the formula as: 


1 


maz(r) 


OpS = (3) 
where r is the possible operations risk, e.g., data breaches. Similarly, the risk of 
the highest severity value is the representation of the protection level of opera- 
tions. 

Similar to the Ops, the score of compliance is defined as: 


1 


maz(r) 


ComS = 


(4) 


where r is the possible compliance risk based on the organization’s geographic 
location, its jurisdiction and industry. As an example, a bank located in EU 
needs to comply with GDPR [36] and PCI DSS [11]. 

Last, we give the overall formula to compute the protection level score of 
cloud data in an organization, which is defined as: 


S=ax IntraS + 8 x InterS +y x OpS +6 x Coms (5) 


where the parameters a, 3,y, and ô can be configured based on the expert knowl- 
edge and specific application scenarios. Based on our empirical experience, the 
default values of those parameters are 0.2, 0.2, 0.3, and 0.3 respectively. 


4.3 Case Study 


We leverage the protection score presented above to assess a financial enterprise 
that would like to be anonymous, and the result is in accordance with a paral- 
lel manual evaluation by experts. Table 5 shows four sub-scores which reflected 
four evaluation factors mentioned before. For intra-phase score, we obtained the 
highest severity value in the SHARE stage, this is because the target of eval- 
uation has a weak access control for user PII. While for inter-phase score, we 
identified the highest severity value when data from the ARCHIVE stage to 
DESTROY stage due to a lack of approved and unified mechanism for destroy- 
ing no longer retention data. Then we use the default parameters mentioned 
in Sect. 4.2, the overall protection score of the target of evaluation’s cloud data 
is 0.46. We mapped this score value to the magnitude of the numeric interval 
listed in Table 4, and it shows that the overall protection level is medium. This 
is consistent with the manual qualitative assessment by another team. 


Considerations on Evaluation of Practical Cloud Data Protection 65 


Table 5. Case study scores. 


Phases Severity value 

of highest risk 
IntraS CREATE | 8.00 
STORE 6.00 
USE 2.00 


SHARE 10.00 
ARCHIVE | 6.00 
DESTROY | 6.00 


Score 6.88 
InterS CREATE | 6.00 
STORE 2.00 
USE 6.00 


SHARE 2.00 
ARCHIVE | 8.00 
DESTROY | 2.00 


Score 1.42 
Ops - 2.00 
ComS - 8.00 
Overall Score |- 0.46 


5 Related Work 


5.1 Cloud Security Assessment 


Traditional information system risk assessment mechanisms are still effective for 
cloud computing environments. However, as a popular computing architecture, 
the cloud computing environment has some aspects that are unique to other 
IT system risk assessments. First, the cloud environment involves more enti- 
ties, including CSPs, CSCs, cloud users, cloud auditors, cloud carriers, and so 
forth. These stakeholders bring more challenges to cloud security assessment. 
Second, the technology stack of cloud computing architecture is more complex, 
and the evaluation targets include components owned and used by multiple par- 
ties such as physical environment, virtualization, and applications. In addition, 
compliance with the cloud environment is also an important aspect of cloud risk 
assessment [9]. 


5.2 Data Security and Privacy 


Data security, privacy, and ethics come to be widely considered in the security 
community and among the legal profession. Whether data incorporates security 
and privacy controls in its life cycle is a critical observation for data security 
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assessment. Data classification and access control matrix are important ideal 
data risk assessment tools as well as data security controls. Moreover, continuous 
monitoring is also used to assess whether the exchange of data violates the 
organization’s data security policy [12—14, 26, 43]. 


6 Conclusion 


In this paper, we make a comprehensive review of each aspect of cloud data 
protection including security, privacy, and ethical considerations. To evaluate an 
organization’s cloud data protection level, we propose an empirical model that 
calculates the protection score based on four important factors we consider. A 
novel algorithm we present can improve the ability of automated evaluation and 
the credibility of evaluation results. However, frankly speaking, our evaluation 
model is still a semi-automatically model that also needs experts to identify risks 
and conduct other manual activities. This will be our further research goal. 
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Abstract. The network intrusion detection system (NIDS) plays an 
essential role in network security. Although many data-driven approaches 
from the field of machine learning have been proposed to increase the 
efficacy of NIDSs, it still suffers from extreme data imbalance and the 
performance of existing algorithms depends highly on training datasets. 
To counterpart the class-imbalanced problem in network intrusion detec- 
tion, it is necessary for models to capture more representative clues 
within same categories instead of learning from only classification loss. 
In this paper, we proposed a self-supervised adversarial learning app- 
roach for intrusion detection, which utilize instance-level discrimination 
for better representation learning and employs a adversarial perturbation 
styled data augmentation to improve the robustness of NIDS on rarely 
seen attacking types. State-of-the-art result was achieved on multiple 
frequently-used datasets and experiment conducted on cross-dataset set- 
ting demonstrated good generalization ability. 


Keywords: Network intrusion detection - Self-supervise learning - 
Adversarial learning 


1 Introduction 


While the advent of the Internet has brought immense convenience to our daily 
lives in recent decades, it has also unavoidably introduced dozens of new chal- 
lenges. As people nowadays spend more time in cyberspace than real world no 
matter living or working, attacking on network activities with various kinds of 
intrusion techniques to prey privacy information or corporation confidential infor- 
mation has never stop. Therefore, as a counterpart, the intrusion detection sys- 
tem (IDS) which safeguard the integrity and availability of key assets has always 
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been a hot research topic in computer and network security community. In con- 
trast to host-based IDS which are distributed at end point users’ system, network 
intrusion detection system (NIDS) primarily characterized as a solution inside 
the data transfer pipeline between computers that can monitor the network traf- 
fic and alert or even take active response measures when malicious behavior is 
spotted [4]. Other than some NIDS designed for specific network environment 
[16,19,24] like Hadoop-based platforms or particular cloud system, most general 
NIDS researches |10, 14,33,38] were performed on network intrusion detection 
datasets to demonstrate and compare their effectiveness and generalization ability 
in a data-driven fashion. 

Among several limitations of existing algorithms, data imbalance in different 
classes, especially the lack of data in rarely seen attacking categories, is a one of 
the most challenging problems. However, it is also a very common phenomenon in 
network intrusion detection datasets considering the difficulty in data collection 
or generation. Benign traffic is no doubt the majority part of internet data trans- 
fer, not to mention the inherent nature of malicious network activity as of being 
disguised. While the performance of most traditional ML-based method declines 
significantly in the case of learning from imbalanced data, a large amount of 
researches try to address this problem by various approaches [5, 8, 20, 28, 34, 36, 38]. 
Recently, contrastive learning has drawn a lot of attention with impressive perfor- 
mance improvement [27,35] in computer vision and natural language processing. 
Besides supervised contrastive learning, instance-level discrimination framework 
in self-supervised fashion have also shown promising result with few-Shot classi- 
fication [21] and quickly being used in NIDS research [22]. 

Inspired by the success of contrastive learning and adversarial learning in 
CV and NLP, in this paper we proposed a self-supervised adversarial learning 
(SSAL) approach for network intrusion detection. The main contributions of this 
paper are as follows: 


— First, we utilized an adversarial learning approach for NIDS with design 
of netflow-based adversarial examples, which improves robustness on class- 
imbalanced datasets by explicitly suppressing the vulnerability in the repre- 
sentation space and maximizing the similarity between clean examples and 
their adversarial perturbations. 

— We proposed a 2-stage pre-train style self-supervised learning in SSAL that 
leverages instance-level self-supervised contrastive learning and adversarial 
data augmentation to achieve a better representation over limited sample, 
which has not been proposed for NIDS to the best of our knowledge. 

— We conducted a experimental evaluation with existing methods on multi- 
ple datasets including UNSW-NB15 [26] and CIC-IDS-2017/2018 [31], which 
shows boosted performance of several machine learning baselines across dif- 
ferent datasets. 


2 Related Work 


In this section we summarize the algorithms and research work related to this 
study. 
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2.1 Network Intrusion Detection System 


Data-driven methods have been developed and deployed for NIDSs for more 
than two decades [9]. In order to achieve an effective NIDS, various methods 
including both machine learning (ML) and deep learning (DL) techniques have 
been proposed by research community. 

Traditional machine learning algorithm such as KNN, PCA, SVM, and tree- 
based models have all been adopted with intrusion detection, and often used as 
baseline for particular improved module. For example, Gao et al. [11] used clas- 
sification and regression trees (CARTs) on NSL-KDD datasets with a ensemble 
scheme where multiple trees were trained on adjusted sampling. Karatas et al. 
[17] addressed the dataset imbalance problem by reducing the imbalance ratio 
using Synthetic Minority Oversampling Technique (SMOTE), and used different 
ML algorithms as a baseline for cross comparison that shows improved detection 
ability for minority class attacks. 

Recent studies suggested that the use of DL algorithms for NIDSs have much 
superior performance than the ML-based methods. RNN and autoencoder [1] was 
pointed to be the most frequently used models for NIDS in past decades. Regard- 
ing data imbalance, Yu et al. proposed a CNN-based few shot learning model 
to improve the detection reliability of network attack categories with the few 
sample problem. Manocchio proposed FlowGAN [23] which utilized generative 
models for data augmentation. However, most DL schemes are more complex 
and require extensive computing resources compare to ML-based methods. 


2.2 NIDS Datasets 


High-quality data sets are definitely required to fully evaluate the performance 
of various intrusion detection systems. Many contributions have been published 
in recent years containing representative network flow data with different kinds 
of preproccess, which are provided mainly in three categories of formats. 


Packet Based Data. The most original and commonly used format is packet 
based data captured in pcap format and contains payload. Early NIDS datasets 
does not provide packet based data because it takes too much storage space. 
But datasets published more recently like CIC-IDS-2017/2018, UNSW-NB15 
and LITNET-2020 |7] tend to provide both pcap files and flow based features 
for the benefit of comparison between different NIDS methods. 


Flow Based Data. Flow based data is much more condensed compare to 
packet based data. It aims to describes the behavior of whole network connec- 
tion session by aggregate all packets sharing same properties within a time win- 
dow. Commonly used flow-based formats includes NetFlow [6], OpenFlow [25] 
and NFStream [2]. CICFlowmeter (formerly known as ISCXFlowMeter [32]) is 
another important network flow format generator, which tranfers pcap files into 
more than 80 netflow features, since it was published by Canadian Institute for 
Cybersecurity therefore used by both CICIDS-2017 and CICIDS-2018. 
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Other Data. This summarize all data sets that are neither purely packet- 
based nor flow-based. For example, The KDD CUP 1999 [18] contains host- 
based attributes like number of failed logins, which can only obtained from above 
network interface. As a consequence, dataset of this category has its own set of 
attributes and can not be unified with each other. 


2.3 Contrastive Learning 


Contrastive learning techniques has been widely used in metric learning such 
as triplet loss [30] and contrastive loss [13]. While in recent self-supervised 
approaches, contrastive learning mostly shares a core idea of minimizing var- 
ious kinds of contrastive loss (i.e. NCE [12], infoNCE [27]) evaluated on pairs 
of data augmentations. Typically, augmentations are obtained by data transfor- 
mation (i.e. rotation, cropping, color Jittering in CV, or masking in NLP), but 
using “adversarial augmentations” as challenging training pairs that maximize 
the contrastive loss shows more robustness in recently study [15]. 


3 Approach 


In this section, we will explain the main algorithms of our proposed self- 
supervised adversarial learning framework for data imbalance network intrusion 
detection. 
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Fig. 1. preprocess pipeline from PCAP files to flow-based feature vector 


3.1 Data Preprocessing 


To build a comparable cross-dataset evaluation process, we adopt commonly 
used datasets UNSW-NB15, CIC-IDS-2017 and CIC-IDS-2018, as they not only 
contain a wide range of attack scenarios but also provide original pcap files that 
can be easily processed into unified feature set. CIC-IDS-2017 dataset is made up 
of 5 days network traffic with 7 different network attacking, which forms 51GB 
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size of data. The benign traffic was generated with profile system to protect user 
privacy. It provides both network traffic (pcap files) and event logs for attack 
label on each machine. CIC-IDS-2018 dataset is also created by CICFlowMeter 
but with both benign and malicous profile system, and has more than 400GB 
pcap data among 17 days. UNSW-NB15 was release in 2015 by Australian Centre 
for Cyber Security (ACCS) that contains a total of 100 GB of pcap files, consist 
of 2,218,761 (87.35%) benign flows and 321,283 (12.65%) attack ones. 

After obtaining original PCAP files, we follow the setting from [29] and take 
43 extended feature dimension from the latest netflow version 9 flow-record for- 
mat |6] for flow-based feature extraction (full feature set can be obtained from 
[29]). Netflow was proposed by Cisco and has become one of the most commonly 
used flow-based formats for recording network traffic. A network flow stream is 
an aggregation of a sequence of packets in a continuous session (of TCP connec- 
tion by default) with the same source IP, source port, destination IP, destination 
port, and transport protocol. The distribution of our processed unified dataset 
is shown at Table 1. 


Table 1. Distribution of Unified Dataset 


NF-UNSW-NB15 | CIC-IDS-2018 | CIC-IDS-2017 | Summary | Ratio 
Benign 2295222 16635567 2359087 21289876 88.29% 
Fuzzers 22310 0 0 22310 | 0.09% 
Analysis 2299 0 0 2299 | 0.01% 
Backdoor 2169 0 0 2169 | 0.01% 
DoS 5794 483999 252660 742453 | 3.08% 
Exploits 31551 0 0 31551 | 0.13% 
Generic 16560 0 0 16560 | 0.07% 
Reconnaissance 12779 0 0 12779 | 0.05% 
Shellcode 1427 0 0 1427 | 0.01% 
Worms 164 0 0 164 | 0.00% 
BruteForce 0 120912 15994 136906 | 0.57% 
Bot 0 143097 1966 145063 | 0.60% 
DDoS 0 1390270 41835 1432105 | 5.94% 
Infiltration 0 116361 0 116361 | 0.48% 
Web Attack 0 3502 0 3502 0.01% 
portscan 0 0 158930 158930 | 0.66% 


Session stream separation might be a little tricky since streams obtained by 
only quintuple may not be accurate and contain too much data packets. Inspired 
by [37], other than following tcp handshake flags, we further segment streams 
by a timeout mechanism to cut idle stream into more pieces with periodic reset. 
The procedure of generating NIDS datasets with unified feature set is show in 
Fig. 1. 
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3.2 Self-supervised Adversarial Learning 
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Fig. 2. Self-supervised Adversarial Learning vs. Vanilla Contrastive Learning 


In self-supervise styled contrastive learning (CL), the dataset D = {x;}*_, is 
unlabeled, and each example x; from a mini-batch is either paired with a positive 
sample x, by transformations T or a negative sample z; / A jvi: CL seeks to learn 
an invariant representation of x; by minimizing the distance between positive 
samples defined as: 


exp(sim(x;,x;)) (1) 
X exp(sim(x;,Xx)) 
While Chen et al. demonstrate in SimCLR [3] that a temperature parameter 
T and a non-linear projector G after backbone network is crucial to the perfor- 
mance of self-supervise CL, we adopt SimCLR loss Lsimcir for the base setting 
of SSAL: 


Lou = — log 


exp(sim(z;,z;)/T) 
an exp(sim(z;,Z)/T) 
where h; = f(x), bh; = f(x), 

and z;=g(h;), z; =g(h;) 


LsimcLR (Xi, Xj) = — log 


? 


(2) 


II 
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Adversarial Attack. The design of positive and negative sampling strategy 
is key to performance of CL models, and the robustness of model will largely 
depend on the difficulty of proposed sample pairs. As opposed to vanilla con- 
trastive learning, self-supervise adversarial learning leverages adversarial aug- 
mentation to ease the difficulty in hard sample mining. Define the perturbation 
e using Loo-Norm attack for example: 


€ = arg max LgimcLR(Xi, Xi + €) (3) 


[lello 


With perturbations e€ given in certain radius that lead to the most diverse 
positive pairs, we have a adversarial training scheme by both encouraging the 
learning algorithm to produce a more invariant representation upon updating 
parameter 0 and then find the e’ under 6’ again. This pipeline is described in 
Fig. 2 (Fig. 3). 


Xi ts 
; ) Instance-level 
Stage 1. SSAL pre-train Xi Backbone F | —. Projector — oI Contrastive Loss 
with parameter @ G ) 
— — 


Freeze 
Parameter 


Xi 


_ Classifier | —, Classification Loss 


p 


Backbone F 


with parameter @ 


Stage 2. Classifier finetune 


tad 


Fig. 3. Framework of proposed 2-stage SSAL NIDS training process 


3.3 Classifier Fine-Tune 


With SSAL we can already pre-train the model without any class labels in adver- 
sarial fashion, but without class annotation pre-trained model cannot be directly 
used for class-level classification. 

Therefore we froze the parameter 0 from pre-trained model f, and switch 
projector head g with a non-linear classifier ~. The training was conducted 
under standard multi-class single-label training: 


zi = W(f(x:)), fori =1,2,...,N 


Pie = o (zie) = —-—, forc=1,2,...,M (4) 


ei enh 
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with cross entropy loss: 


M 
Lee(Xi; L) =— 5 Yi, c log(pi,c) 
é=1 


(5) 


The full process of proposed 2-stage SSAL for NIDS is shown in Algorithm 1. 


Algorithm 1: self-supervised adversarial learning for NIDS 
1 Stagel SSAL pre-train 


e © NOANA nwn 


Naoaprpw hn 


input : Dataset D = {x;})_, 
output: model f 
Initial model f with parameter 0 and projector g 
repeat 
for all x € minibatch B do 
generate € = arg max.) Csimcir(Xi, Xi + €) 
6’ =0 + V2L£simcir(x, x+ €) 
end 
until reach epoch N or L < ôı 


Stage2 Classifier Fine-tune 


input : Dataset with label D = {x;,1;}*_,, model f with parameter 0 
output: model f and classifier w 
Initial classifier y with parameter p, freeze 0 
repeat 

for all x € minibatch B do 

P = p + Valee(xi, li) 

end 

until reach epoch N or L < 62 


4 Experiment Results 


Metric and Implementation. The evaluation is conducted by comparing the 
classifier performance with various classification metrics. The intrusion detec- 
tion datasets we evaluate on contain several attacking categories, which can be 
treated as both binary classification and multiple classification problem. While 
comparing performance under binary classification scenario, the basic terms used 
in the evaluation is as follow: 


TP+TN 
A A = 
ccuracy(ACC) TPL FPLTN GFN’ 
DetectionRate(DR) = na k.a Recall 
etectionRate = 7P FN’ a.k.a Recall, 
TP 
P } } — Å 
recision TPL FP’ 
F1Score = 2 x Precision x Recall 


Precision + Recall 
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where TP stands for numbers of true positive samples, FN for false negative, 
and so forth. 

For multi-class classification setting with more detailed label of attacking 
types, weighted average measure of above metric was adopted considering the 
proportion for each label in the dataset. To achieve a fair evaluation, five cross- 
validation splits are conducted and the mean is measured. 


Evaluation on Unified Feature Dataset. With the unified feature set 
upon pre-processed UNSW-NB15 and CIC-IDS-2017/2018 dataset mentioned in 
Sect. 3.1, we conduct a evaluation across multiple datasets. For the purpose of 
comparison, we implemented a simple MLP and the Extra Trees model from [29] 
as baseline models. In Table 2, we can see that our SSAL method achieved out- 
standing result in all three datasets and exceed previous works in most metrics. 


Table 2. Performance on unified dataset 


Dataset Metric MLP | Extra trees [29] | SSAL 
NF-UNSW-NB15| ACC | 91.02 | 99.73 99.71 
DR 79.45 | 97.07 97.45 
CIC-IDS-2017 ACC | 88.42 97.46 99.57 
DR 82.61 | 96.54 97.14 
CIC-IDS-2018 ACC | 83.13 | 99.35 99.89 
DR 76.63 97.12 98.63 
Overall ACC | 88.1 97.91 99.63 
DR 81.6 | 96.65 97.35 


Table3 presents the detailed detection results of different attacking class on 
the merged NIDS dataset. While using the same backbone (Multi-Layer Percep- 
tron), the performance of model with SSAL pre-train was largely improved on 
rare seen attacking data. 


Table 3. Detailed performance of different classes on unified dataset.(ACC) 


ClassName MLP | SSAL | Class Name | MLP | SSAL 
Benign 81.53 | 99.14 | Shellcode 21.73 | 89.13 
Fuzzers 64.31 | 84.65 | Worms 34.86 | 60.91 
Analysis 46.82 | 82.43 | BruteForce | 65.69 | 88.54 
Backdoor 49.73 | 85.77 | Bot 61.34 | 81.17 
DoS 54.34 | 97.63 | DDoS 53.52 | 99.76 
Exploits 59.11 | 93.22 | Infiltration | 54.43 | 73.82 
Generic 58.27 | 88.61 | Web Attack | 62.77 | 71.44 
Reconnaissance | 32.92 | 91.28 | portscan 46.98 | 85.38 
Overall 76.63 | 98.63 
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Further Ablation. To further demonstrate the superiority of our proposed 
method, we compare our method with different backbone networks with abla- 
tion studies upon SSAL modules. We first use two different frequently used 
backbones, MLP and CNN, and plug them with SSAL pre-train for represen- 
tation learning. The evaluation result shown on Table 4 proves that SSAL can 
effectively enhance the ability of network intrusion detection systems. As for fea- 
ture extraction, Table 5 shows the result of different classifiers when SSAL was 
used as a feature extractor. We first pre-train with all unlabeled training data 
with SSAL for feature extraction, then freese the network parameter and use 
SVM or k-NN as a classifier to check the representative ability of SSAL model. 


Table 4. Performance with different backbone.(ACC) 


ACC UNSW-NB15 | CIC-IDS-2017 
MLP 87.81 88.63 
MLP + SSAL | 98.37 98.82 
CNN 91.44 90.52 
CNN + SSAL | 97.53 97.74 


Table 5. Performance with different classifier.(ACC) 


CIC-IDS-2017 | SVM | k-NN 
w/o SSAL 85.62 | 81.46 
with SSAL 96.72 | 94.91 


5 Conclusion and Discussions 


In this paper, we try to tackles the data imbalance problem in network intru- 
sion detection with adversarial style data augmentation and self-supervised con- 
trastive representation learning. More specifically, we proposed a self-supervised 
adversarial learning way to enhance the representative learning progress in deep 
learning based NIDS, which utilizing a instance-wise attack to yield a robust 
model by suppressing theirs adversarial vulnerability against perturbation sam- 
ples. State-of-the-art performance was achieved on commonly used Experiments 
on multiple datasets show improvement of proposed learning framework against 
vanilla DL approach with same backbones. 

In addiction to the conclusion, there are also some works could be done in 
the future. Although we among other researchers have made a lot of effort on 
data imbalance for network intrusion detection problems, there are still more 
gaps need to be filled to a robust and applicable NIDS. For instance, in our 
method the result from different feature sets shows noticeable performance gap. 
we believe that to further improve the representative ability of network flow 
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data with a standard and comprehensive behavior feature set is key to better 
data-driven NIDS solution. Also we are looking forward to explore an universal 
end-to-end approach for more generalized NIDS which could greatly reduces the 
difficulty of system deployment. 
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Abstract. Econnoisseur refers to users who obtain high returns from the Internet 
at low cost. It is of great significance for platform to identify econnoisseur to 
reduce unnecessary losses. At present, econnoisseur is mainly intercepted by rules. 
This method will fail when the new get the best deal method appears, and there 
is a certain lag. This paper identifies the econnoisseur from Knownsec Security 
Intelligence Brain’s e-commerce website visitors. First of all, it is found that 
the precision and recall of the Isolation Forest are better than the Local Outlier 
Factor and DBSCAN in econnoisseur detection. Secondly, we merged the similar 
URLs visited by users with Bi-directional Long Short-Term Memory (BiLSTM), 
then use the merged data in Isolation Forest Model. It is found that the improved 
Isolation Forest model based on BiLSTM can further improve the detection ability. 
Practical case studies showed that this method has certain validity and reference 
for the detection of econnoisseur. 


Keywords: Econnoisseur - Isolated Forest - Local Outlier Factor - Anomaly 
detection - Bidirectional Long Short Term Memory - User behavior 


1 Introduction 


In recent years, e-commerce platforms have shown a trend of peak traffic dividends. Each 
platform will provide marketing activities to win customers and improve user stickiness. 
The process also give birth to the econnoisseur who exploit vulnerabilities in platform 
activity to profit. 

According to the analysis report on the application of digital financial anti fraud Tech- 
nology (2021) released by Institute of Cloud Computing and Big Data of China Academy 
of Information and Communications Technology and the ICBC Security Attack and 
Defense Laboratory, it is found that the total loss caused by the anti-fraud of black 
industry(see Fig. 1), showing an increasing trend every year, and the loss is expected 
to reach 710 billion yuan in 2022, the econnoisseur accounts for a large proportion. In 
2019, Pinduoduo was robbed of tens of millions of yuan by the econnoisseur within a 
few hours because of an expired coupon bug on the platform. In 2021, Jingdong Mall 
was discovered and spread by the econnoisseur due to the wrong coupon setting, result- 
ing in a direct loss of nearly 70 million yuan. On the one hand, the existence of the 
econnoisseur damages the profits of ordinary users, on the other hand, it also greatly 
reduces the company’s activities, and its governance is urgent. 


© The Author(s) 2022 
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Fig. 1. Fraud losses and forecasts as a percentage of GDP 


2 Related Work 


The application of anomaly detection in the field of cyber security focuses on APT 
detection, intrusion detection and so on. 

APT attack is a hidden and persistent network intrusion process, which carries out 
advanced persistent threats against specific targets. Bohara [1] compared various unsu- 
pervised algorithms such as K-means for APT detection and found that it can detect 
infected hosts. Zhong Yao [2] performed anomaly detection on traffic log data based on 
Isolated Forest and found that it has certain detection ability against APT attacks and 
can mark the suspected infected hosts. 

Intrusion detection is a system that detects intruders in a network. The detection 
methods can be divided into supervised and unsupervised based machine learning algo- 
rithms. References [3—6] are mainly based on supervised algorithms such as Naive Bayes, 
Bayesian Networks, Hidden Markov Models, and ensemble learning for intrusion detec- 
tion. This detection method requires a large number of labeled sample data, but there will 
be insufficient sample label data in many scenarios. In literature [7—11], unsupervised 
algorithms such as K-means clustering, hierarchical clustering and DBSCAN are used 
for intrusion detection, which has good detection ability. 

At present, the research of econnoisseur detection is still in the theoretical stage, the 
engineering is mainly based on traditional threshold setting or rule-based interception. 
Yuan Dandan [12] based on the community discovery algorithm, identified the econ- 
noisseur with similar characteristics into groups. When the econnoisseur characteristics 
change this method fails. 

The challenges of e-commerce econnoisseur identification are as follows: 1. Rule 
omission. at present, most platforms intercept econnoisseurs based on rules, but the 
detection will be missed when econnoisseur behavior changes. 2. Insufficient sample 
labels. The econnoisseur is newly added every day, and manual labeling requires a large 
labor cost. 3. Model detection lag. With the iterative update of the econnoisseur’s method, 
the existing rules and models have a certain lag, so it is necessary to periodically iterate 
and maintain the model. 
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3 Theoretical Basis 


Anomaly detection can be divided into supervised and unsupervised anomaly detection. 
Since the econnoisseur is generated in real-time, unlabeled sample data is mainly used 
in practical applications, this paper uses the unsupervised anomaly detection scheme. 
Considering the different applicability of detection methods in different scenarios and 
data, this paper compares three commonly used anomaly detection schemes to see their 
ability to identify econnoisseur in e-commerce website log data. 


3.1 Isolated Forest (IForest) 


The isolation forest model was first proposed by Zhou Zhihua’s team and Fei Tony Liu 
of Monash University [13] as an ensemble learning method. It was used in the field 
of industrial anomaly detection due to the advantages of high accuracy and linear time 
complexity. The theoretical basis of the model are: 1. There are differences between 
abnormal data and normal data. 2. The proportion of abnormal data is relatively small. 
These two theories are consistent with the econnoisseur detection.The Isolation Forest 
algorithm cuts the data space through a random hyperplane. The data plane can divide 
the data into two subspaces at a time, until each subspace has only one sample point or 
reaches the given height of the tree. 

The Isolated Forest needs to be trained on the Isolated Tree first to obtain the Isolated 
Forest. After that, calculated the isolated score S of each test sample, then compare the 
difference between isolated score S and the given threshold to see whether the sample 
is an abnormal sample. 


Algorithm 1. Isolated Forest training 


Input:X -input data, number of trees, y-subsampling size 
Output: a set of ¢ iTrees 
1: Initialize Forest 
2: set tree height limit / =ceiling(logay ) 
: for i = 0 to t do 
X =sample(X, y), randomly select subsample y 
Initialize iTree;, tree depth e=0 
while |X'|>1 ande <1 
randomly select an attribute q from attributes Q 
8: randomly select a split point p from max and min values of attribute q in X 
Xı + filter(X,q < p) 
Xr & filter(X,q > p) ; 
inNode{Left — filter(X,,e+1,1), 
right + filter(X,,e + 1,1), 
SplitAtt + q, 
SplitValue + p} 


uo ae 


10: end while 

11: Forest + Forest UiTree (X^) 
12: end for 

13: return Forest 
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After data training to obtain an Isolated Forest, the anomaly score of the test sample 
can be evaluated based on the generated Isolated Tree. Since the structure of the Isolation 
Tree is consistent with the binary search tree (BST), the average path length of the tree 
is consistent. Based on this, BST is used to estimate the average path length of isolated 
tree. 


c(n) = 2H (n — 1) — (2(n — 1)/n) (1) 


H(n) = In(n) + 0.5772156649 (2) 


Formula (1) is the average path depth of the isolated tree composed of n samples, and it 
is used to standardize the depth of the samples on the Isolated Tree, so that the abnormal 
score of the test sample x is shown in Formula (3). 


_ E(h(x)) 
s(x,n) =2 (3) 
t 
E(h(x)) = Yo hi(x)/t (4) 


i=1 


Algorithm 2. Calculate the sample anomaly score 


Input: x - an instance, Forest-Isolated Forest 
Output: anomaly score s 


1: Initialize tree depth h(x) = [ ] 

2: fori =0totdo 

3: extract the ith Isolated Tree iTree;, initialize tree height e = 0 
4: if iTree; is an external node 

5: h(x) =e + c(iTree;.size) ¢(,.) is defined in Equation(1), 
6: endif 

7. a<iTree;.splitAtt 

8: ifta < iTree;.splitValue 

9: hi(x) = PathL (x, iTree;.le ft) 

10: else 

11: hi(x) = PathL(x,iTree;.right) 

12: endif 

13: end for 


14: calculate anomaly score s(x,n), s(.) is defined in Equation(3) 
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3.2 Local Outlier Factor (LOF) 


The Local Outlier Factor [14] is a model that determines whether the sample points are 
abnormal based on the density. Its core concept is that the density of abnormal points is 
smaller than that of other points. Before introducing the Local Outlier Factor model, we 
need to understand some basic concepts. 


Definition 1: (k-distance). For point p, sort the distances between point p and other 
points from small to large, and the k-th closest distance point to point p is k-distance of 
point p. If point o is the k-th point closest to point p, then distance is k-distance of object 
p, 1.€. 


k_distance(p) = d(p,o) (5) 


Definition 2: (k-distance neighborhood). Draw a circle with point p as the center and 
k-distance as the radius. The points in this circle is the k-distance neighborhood of p, i.e. 


Np) = d(p, 0’) < dk(p) (6) 


Definition 3: (reachability distance). Take point o as the center, and take the maximum 
value of the k-th distance nearest to point o, then the distance is the reachable distance 
from point p to point o. 


reach_dist,(o, p) = max{d(o), d(o, p)} (7) 


Definition 4: (local reachability density). The reciprocal of the average reachable dis- 
tance in the neighborhood of point p is the local reachable density of point p, defined 
as 


1 


>> oN; (p) reach_dist,(p,o) ` 
IN: (p)| 


Ird (p) = (8) 


Definition 5: (local outlier factor). The mean of the local reachability density of points 
in the field divided by the local reachability density of point p is the local outlier factor 
of point p, defined as 


Eoo NTS 


SOS g] 


(9) 
Algorithm steps: 


1. Calculate the distance between each sample point and all other points, sort them 
from near to far. 

2. For each sample point, find the point in its k-distance field, and then calculate its 
LOF score. 

3. Given the threshold, if the LOF value of the sample point is higher than the threshold, 
the sample point is an abnormal point. 
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Therefore, the algorithm calculates the density of the samples based on the local 
points in the k field of the sample points. The lower the density, the greater the probability 
of abnormal samples. 


3.3 DBSCAN 


DBSCAN(Density-Based Spatial Clustering of Applications with Noise) [15] is a density 
based spatial clustering algorithm. 
The concepts involved in the algorithm are: 


Definition 1: (Core user). For sample point p, give a distance e, if there are at least 
Minpts sample points within ¢ neighborhood, then p is the core point. For point p, its 
density is defined as o(p) = |N;(p)|. Where N,(-) denotes the set of points in its € 
neighborhood. If p is a core user, then defined it as. 


p(p) = |Ne(p)| = Minpts (10) 


Definition 2: (Directly density-reachable). If point p and point q is directly density- 
reachable, the following two conditions must be satisfied: 

i) p is in the £ neighborhood of a core point q p € Ne (q). . ii) q is core user |Ne (q)| > 
Minpts. 


Definition 3: (Density-reachable). If there is a point o so that both p and q can be directly 
density-reachable, then the point p and q densities is density-reachable. 


Definition 4: (Border user). For sample point p, if the sample points included in the € 
radius are smaller than Minpts and the sample is in the field of other core points, sample 
p is the border user. 


Definition 5: (Noise user). If sample point p is of non core user and border user, then it 
is names noise user. 
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Algorithm 3. DBSCAN 


Input: P-sample data, -scan radius, Minpts-minimum number of points included 


Output: sample category set X = {x),x2,.., Xn} 
1: set sample category Xi = 


number of categories k=1, h =o 


Z: 
3 
4: 
5: 
6 
7 
8 


9: 

10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 


while P 40 


randomly take a user p; from P 

ifti = 0 

if P(Pi) = |N: (pi)| < Minpts 
zi = —1 


t;=k 
add users in Ve(P)into h; 
while h#o 
randomly select h from a user Pj 
ifti = Qor-1 
=k 
end if 
if P(P;) = |N-(p;)| > Minpts 
add users in V-(P)into h; 
end if 
end while 
k=k+1 
end if 
end if 
end while 


OQ,l<i<n 


4 Research Process 


4.1 Data Sources 


The data is based on the real-time streaming log data in the Knownsec Security Intelli- 
gence Brain. The fields in the log data are: access time, access IP, user agent, URL link, 
website domain name, etc. Three e-commerce websites log data was screened, an IP and 


user agent was regarded as an independent visitor of the website. 


Log data with abnormal access URLs or few visits is deleted to reduce interference 
to the model. The data of three e-commerce websites in November 2021 are sampled for 
observation to see their performance in one month. The website visit situation is shown 


in Table | : 
Table 1. Website visit number. 
Website Daily average visits Daily average visited users 
Web A 6,621,059 155,070 
Web B 1,298,633,918 29,644,844 
Web C 51,625,887 2,539,750 
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4.2 Feature Construction 


After analyzing the access data of some users, it is found that the econnoisseur can be 
divided into two categories: 1. Monitoring users, who monitor the preferential infor- 
mation of commodities in real-time and at low frequency. 2. Activity type users, who 
make high-frequency application and purchase of goods with large discounts during the 
activity period. 

Combing with the characteristics of the econnoisseur, we constructed three charac- 
teristics for users: website visits, website visit time and different website visits, totally 
nine features are shown in Table 2. 


Table 2. User features 


Feature category Feature name Explain 
Website visit Visit PV Total visit number 
URL visits Remove duplicate URL visits number 
URL concentration TOP three URL visit number divided 
by total visit number 
Website visit time Total visit time Time interval between start and end 
time 
Visit periods 10 min as a time period, if the interval 


between adjacent access points is more 
than 10 min, time period is increased 


Average visit time Total visit time divided by visit periods 


Different website visits | current website visit ratio Current website visit number divided 
by user total visit number 


E-commerce websites ratio | E-commerce websites visit number 
divided by user total visit number 


E-commerce host ratio E-commerce websites host visit number 
divided by total host visit number 


4.3 User Behavior 


Reduce the user feature data of website C to 2D for visualization based on t-SNE (t- 
distributed Stochastic Neighbor Embedding), as shown in Fig. 2. User data can be divided 
into four categories according to the color of points. Some sample data were extracted 
from the four categories and analyzed. It was found that the econnoisseur appeared in 
categories two and four, while the normal users were concentrated in categories one and 
three (Fig. 3). 
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Fig. 2. User dimension reduction visualization 
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Fig. 3. Feature density curves of different categories 


Draw a feature density map for the 4 categories of sample users, shown in Fig. 2. It 
can be seen that there are differences between the characteristics of different categories 
of users, that is. 


Category 1: random visit users. Characterized by small visit number, short visit time, 
and less website visit information. 

Category 2: monitoring users. Users visit specific web pages for a long time and 
infrequently. 
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Category 3: normal access users. The categories of web pages visited by users are 
scattered, and the time of visiting web pages is more than that of category 2, which is a 
normal browsing user of web information. 

Category 4: specific page access users. A large number of visits to the website in a short 
period of time, and the visited pages are targeted. 


4.4 Result Analysis 


The data of the previous week of the website are used as the training set for model training, 
and the econnoisseur detection is carried out on the user access of the latest day. The daily 
update and retraining of training data can obtain the recent overall distribution of users, 
and the model can be adjusted in real-time. During the Double Eleven period, these e- 
commerce companies had promotional activities, which also became the carnival of the 
econnoisseur.The user access data on November 11 and November 18 were extracted to 
compare the detection effect of econnoisseur during the active period and the non-active 
period. 

Since the detected user data is unlabeled data and the sample size is large, 3000 users 
data are randomly selected from each website for labeling to check the test effect of the 
model. In Table 3, it can be seen that the detection precision and recall of the Isolated 
Forest model are higher than those of the other two models, showing that the model has 
better applicability. 

After further analysis, it was found that some of the underreported econnoisseur were 
due to small difference between the amount of website URLs visited by econnoisseur 
and normal users. For example, aa/01 and aa/02, these two URLs are the same type of 
URLs, which can be combined to better to distinguish different user access situations. 
Consider combining URL of the same type based on the Bidirectional Long Short Term 
Memory (BiLSTM) model to reduce its impact on URL visits distribution. The detection 
effect of econnoisseur is shown in Table 4. 

The average detected amount of econnoisseur during the period from November | to 
November 11 was taken as the average of daily econnoisseur amount during the activity 
period. The average detected amount of econnoisseur during the period from November 
12 to November 30 was taken as the average of daily econnoisseur amount during the 
inactive period. The performance is shown in Fig. 4, it can be seen that the econnoisseur 
during the activity period is about twice as much as the non activity period. 

It can be concluded from the above: 


1. The Isolated Forest model performs better in precision and recall than the other two 
models in different e-commerce websites and active and inactive periods, it gave a 
good result in econnoisseur detection. 

2. After combining URL of the same type with the BiLSTM model, the detection ability 
of the BiLSTM-IForest model to the econnoisseur is significantly higher than that 
of the Isolated Forest in the recall rate. 
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The detection ability of econnoisseur during non-activity period is better than that 
during activity period. According to the analysis, it is found that some users have visited 
the preferential products for many times with low frequency. This behavior is similar to 
that of normal users and has not been detected. It can be further optimized later. 


Table 3. Model result comparison 


Activity Model Precision Recall F1-Score 


Activity [Forest 0.74 0.79 |0.91 |0.72 |0.75 |0.82 |0.73 |0.77 | 0.86 
period LOF 0.68 0.75 |0.84 0.57 |0.74 |0.7 |0.62 |0.74 | 0.76 
DBSCAN 0.6 0.56 (0.68 0.62 |0.6 |0.66 |0.61 (0.58 0.67 
Non-activity | [Forest 0.80 0.83 |0.92 |0.74 |0.81 |0.83 | 0.77 |0.82 0.87 
period LOF 0.72 0.77 |0.85 0.6 |0.78 |0.74 |0.65 |0.77 0.79 
DBSCAN | 0.65 0.57 | 0.66 0.66 |0.61 |0.67 |0.65 | 0.59 0.66 


Table 4. Comparison of results before and after improvement of Isolated Forest 


Activity Model Precision Recall F1-Score 


Activity [Forest 0.74 |0.79 0.91 |0.72 |0.75 | 0.82 | 0.73 | 0.77 | 0.86 


period BiLSTM-IForest 0.75 | 0.82 0.92 |0.77 | 0.81 0.87 | 0.76 |0.81 | 0.89 
Non-activity | [Forest 0.80 | 0.83 | 0.92 | 0.74 | 0.81 | 0.83 | 0.77 | 0.82 | 0.87 
period BiLSTM-IForest 0.81 | 0.85 | 0.94 /0.8 | 0.83 | 0.89 | 0.80 | 0.84 | 0.91 
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Fig. 4. Detection amount of econnoisseur users in different periods 
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5 Conclusion 


The rise of e-commerce platforms not only brings convenience to people’s life, but also 
poses a higher challenge to the platform’s risk control ability. How to control and reduce 
the risk of the platform being played for a sucker needs to be solved urgently. Based 
on the real log data of website users visiting the website in the Knownsec Security 
Intelligence Brain, this paper extracts nine features of users to identify the econnoisseur. 

By comparison, it is found that the unsupervised anomaly detection model has certain 
detection ability for the e-commerce website econnoisseur. Among them, Isolated Forest 
has higher detection precision and recall rate than LOF and DBSCAN models in three 
e-commerce websites, and is more suitable for current e-commerce user data. 

After that, the analysis found the same type of URL visited by users, which can be 
combined to better describe the real visit behavior of users. The detection results show 
that the econnoisseur detection based on the BiLSTM-IForest has been further improved. 

The econnoisseur selected in this paper can intercept its traffic access in advance. 
After that, a risk control model can be built based on the actual browsing, purchasing 
and other specific behaviors of users to strengthen real-time prevention and control. 

The econnoisseur and the platform have always been in a state of mutual competition. 
The so-called the devil is one foot tall and the road is one foot tall. We need to track the 
attack methods of the econnoisseur in real-time, and at the same time combine the risk 
control platform to further attack the econnoisseur. 
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Abstract. The Application Program Interface (API) plays an impor- 
tant role as the channel for data interaction between programs, while 
the widespread use of APIs has brought security risks that cannot be 
ignored. The adversary can perform various Web attacks, including SQL 
Injection and Cross-Site Scripting (XSS), by tampering with the param- 
eters of API. Efficient detection of parameter tampering attacks for API 
is critical to ensure the system is running in the expected condition, 
further avoiding data leakage and property loss. Previous works always 
utilize the rule-based method or simple learning-based method to detect 
parameter tampering attacks. However, they ignore the contextual infor- 
mation of the API tokens and thus have a poor performance. In this 
paper, we propose the Context-based Malicious Parameter Detection 
(CMPD) framework to detect the parameter tampering attacks for APIs. 
We use a neural network language model to learn the distribution of the 
parameters, parameter names, and URLs and then use a tree model to 
detect the malicious query based on the high dimensional API embed- 
ding. Experiments show that CMPD outperforms all baseline, including 
rule-based method, Support Vector Machine (SVM), and Autoencoder, 
on CSIC 2010 dataset with Fı value reaching 0.971. CMPD can also 
achieve a 0.895 Fı value when training data is reduced to 20% and can 
achieve a 0.910 F, value when negative examples are reduced to 1%. 


Keywords: Parameter tampering - API - Language model 


1 Introduction 


The API is a combination of a set of definitions and protocols, which plays 
an important role as a channel for data interaction between programs. Modern 
applications are often developed with many well-defined interfaces to improve 
the scalability and compatibility of the program. Although the widespread use of 
APIs has brought great convenience to data access, and thus different terminals 
can access relevant information in a similar way, the extensive use of APIs has 
also brought security issues that cannot be ignored. Especially in the modern 
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microservice architecture, each application is subdivided as much as possible, 
making the security risks faced by APIs more difficult to be detected completely. 
Effective detection of API security risks ensures that the system is running in 
good condition. 

The access of API is based on the HTTP/HTTPS protocol, so the threat on 
the Web protocol may extend to API, such as SQL Injection, Broken Authen- 
tication, Session Management, Cross-Site Scripting (XSS). It should be noted 
that these attacks are always implemented by tampering with the parameters in 
the API. Therefore, to mitigate the threat of API, a key idea is to prevent the 
parameters from being tampered with. The security community has proposed 
a variety of approaches to address the security risks of parameter tampering, 
the most common of which is the rule-based detection. Such methods are often 
implemented by a lightweight agent that first detects security risks that may 
be contained in a Web request before a server process it. If a relevant rule is 
matched and a request is identified as a security risk, the request is filtered out 
to avoid the server from being affected. Although this detection method is simple 
to implement and efficient, its over-reliance on rules that humans preset leads 
to its inability to detect unknown new attacks, and thus not only has a poor 
performance in actual detection but also has a high false-negative rate. 

Deep learning model has achieved remarkable success in various natural lan- 
guage processing (NLP) tasks, and it has been shown that these models can 
effectively learn the data distribution, which is difficult for the rule-based detec- 
tion method to do. By learning the data distribution, the detection of parameter 
tampering can thus be seen as a pattern classification problem whose goal is to 
distinguish the feature pattern of the normal access of API and the malicious 
access of API. However, deep learning methods always need to learn from a large 
amount of data, which may be difficult to obtain. Furthermore, normal access 
is more common than malicious access, and the ratio of normal accesses and 
malicious accesses may be significant unbalance. The unbalance data also makes 
the model difficult to learn the data distribution, as the model may mainly focus 
on the type of data that is more common in the dataset and ignores the less one. 

In this paper, to detect the parameter tampering attack against API and 
reduce the influence of unbalanced data, we propose the Context-based Malicious 
Parameter Detection (CMPD) framework. CMPD improves the effectiveness of 
detecting malicious parameters by learning the distribution of each component 
of the API and builds the relationship amount URL, parameter names, and 
parameters. Experiments show that CMPD outperforms all baseline on CSIC 
2010 dataset, with Fı value reaching 0.97. CMPD also achieves 0.91 F value 
on the unbalance dataset that normal access data is 100 times more than the 
malicious data, and achieves 0.89 Fı value when training data of CSIC 2010 
dataset are reduced to 20%. 

We summarize our main contributions as follows: 


— We propose a semantic extraction and learning module to learn the rela- 
tionship amount URL, parameter names, and parameters, which universally 
models the parameter distribution of different APIs in one framework. 
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— We propose the Context-based Malicious Parameter Detection (CMPD) 
framework, which can effectively detect parameter tampering attacks against 
API based on the context information indicated by the distribution of the 
parameter. 

— Experiments on the CSIC 2010 dataset show that CMPD outperforms other 
baselines, including rule-based methods, Support Vector Machine (SVM), 
and Autoencoder, and achieves competitive results on unbalanced data and 
reduced data. 


2 Related Work 


2.1 Vulnerability Detection for APIs 


To detect API’s vulnerability, current methods mainly focus on black-box test- 
ing, which needs to generate a large number of testing cases. Much research relies 
on crawlers or manual methods to get the detection object, parse out the fuzz 
domain based on the detection object to generate test cases, and use the attack 
pattern library to perform vulnerability detection [1,3,4]. To generate test cases, 
various methods are proposed. Atlidakis et al. [2] propose REST-ler to automat- 
ically generate test requests with a random walk algorithm. Avinash [12] et al. 
proposed six attack patterns for replay attacks to automatically generate test 
cases. Douibi et al. [5] automatically generate test cases for REST API based 
on the description of Swagger and OpenAPI. Because the crawler-based API 
vulnerability detection method has the problem of low coverage and manual 
testing can not be carried out on a large scale, the black box testing method is 
often combined with the interface documentation. Yu et al. [16] propose a fuzz 
system with RESTful API based on SwaggerHub’s development interface and 
improves the effectiveness of fuzz testing by automatically generating test cases 
and automatic filtering. Viglianisi et al. [14] generated normal test cases and 
malicious test cases based on the interface documentation to test the security 
risk of RESTful API. Different tools have also been proposed to automatically 
scan API vulnerabilities, such as FuzzAPI', APIFuzzer”, boofuzz’, and Astra’. 
These tools do not need to obtain source code and interface documents but 
combine manual and crawler methods to achieve vulnerability detection. When 
interface documents are available, the tools such as TNT-Fuzzer®, 42Crunch’®, 
and OWASPZAP’ can directly extract detection objects from interface docu- 
ments to achieve vulnerability detection with high coverage. 


1 https: //github.com/Fuzzapi/fuzzapi. 

? https: //github.com/KissPeter/APIFuzzer. 

3 https: //github.com/jtpereyda/boofuzz. 

4 https: //github.com /flipkart-incubator / Astra. 
5 https: //github.com/Teebytes /TnT-Fuzzer. 

6 https: //42crunch.com. 

T https: / /www.zaproxy.org. 
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2.2 Parameter Tampering Detection for APIs 


To detect the parameter tampering attacks for APIs, both rule-based methods 
and learning-based methods are proposed. ModSecurity® develops the OWASP 
ModSecurity Core Rule Set (CRS), which contains a large number of rules for 
detecting SQL Injection, Cross-Site Scripting, and HTTP Protocol Violations. 
Rieck et al. [11] use the n-grams and a similarity measurement to generate new 
features for anomaly detection. Ingham et al. [6] proposed the Deterministic 
Finite Automata (DFA) induction method, which uses a heuristic algorithm to 
detect abnormalities. Ma et al. [8] use machine learning methods including Naive 
Bayes, Support Vector Machine, and Logistic Regression to learn the distribution 
of static features to detect attacks. Nguyen et al. [10] use a feature selection 
algorithm to reduce the dimension of features extracted from traffic, reducing the 
computational complexity of the learning algorithm. Liang et al. [7] developed an 
RNN-MLP network to detect malicious accesses, where the RNN contains LSTM 
and GRU cells, and the MLP follows the RNN. Wang et al. [15] investigated 
CNN and LSTM and their combination method for malicious detection, which 
outperforms the traditional methods. 


3 Methodology 


3.1 Parameter Tampering Attacks Against APIs 


API parameter attacks attempt to manipulate parameters transmitted between 
the client and server in order to alter application data, such as user passwords 
and permissions, product prices and quantities. This type of data is typically 
kept in cookies, hidden form fields, or URL query strings and is used to regulate 
and enhance the functionality of the program. The attack’s success is conditional 
on integrity and logical validation mechanism faults, and exploiting these errors 
may result in further implications such as cross-site scripting (XSS) and SQL 
injection. The tampering of parameters is frequently limited to several essen- 
tial categories of data: API query parameters, cookies, form fields, and HTTP 
headers. Specifically, for an API, which is consisted of a basic URL u, a group 
of parameter names {n;|i € N}, and a group of parameter {p,;|i € N}. The 
i-th parameter is integrated with the i-th parameter name. Suppose the server 
expects to receive a benign query, and for the target u, all possible benign choices 
of the i-th parameter are denoted as P;, all possible benign choices of the i-th 
parameter name are denoted as M;. Therefore, a benign API query for the target 
URL u can be defined as 


VieEN, pe P;, and VieN, nu, EN; (1) 


And a parameter tampering attack for the target URL u can thus be defined as 


AiG N, p¢Pi, and JiEN, n EN; (2) 


8 https: //github.com/SpiderLabs/ModSecurity. 
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Intuitively, a parameter tampering attack may happen in the following condi- 
tions: 


— The adversary tampers the parameters of an API, attempting to make the 
server process the tampered parameters for malicious purposes such as SQL 
injection and XSS attacks. 

— The adversary tampers the parameters name of an API, attempting to let the 
server process the tampered parameter names to achieve malicious purposes 
such as bypassing verification. 

— The adversary tampers both the parameters and the parameters name of 
an API. Even if the tampered parameters and parameter names are benign 
values for another API, they are indeed malicious values for the current API. 


3.2 Semantic Extraction and Learning 


As parameter tampering attacks have the characteristics of a wide attack surface 
and large scope of tampering, traditional methods cannot realize the judgment 
of whether a request has been tampered with in one model. Furthermore, as text 
information is discrete, traditional methods cannot use the semantic information 
contained in it. Therefore, we use the Semantic Extraction and Learning Module 
to learn the distribution relationship among the basic URL in the API, the 
parameter names, and the parameters and then map discrete text information 
to high-dimensional continuous space. The general frameworks of the semantic 
extraction and learning module is shown in Fig. 1. 
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Fig. 1. Illustration of the semantic extraction and learning module of CMPD. 
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Specifically, suppose there is a neural network with K layers, and the weights 
and the bias vectors for each layer can be defined as 


wi) Ee Rm Xmo b® E R™X1 


w2 e Rmexm pb?) E Rmxt 
(3) 


wE) e R™KXMK-1 bE) c€ Rmx 


where (mo, M1,- , mg) is the number of units in each layer. The active function 
in each layer are denoted as (f, f@),---, f()), and thus the output of the 
K-th layer Y) can be defined as 
Mr-1 
net) = Y WOYE +o (1 <i< ma) 
i=1 
net™ =Wyk-D + pl (4) 
net = [net net, Ms net) | 
y™® = fh) (net) = bag yh) vi) 
, Mk 


We extract the URL u, the parameter names {n,|i € N}, and the parameter 
{pili E€ N} in each API query and then arrange them in the order they appear in 
the query as (wż)te{1,2,-- m}. We randomly remove a token w in (wt)te{1,2,-, M} 
and send the rest token into the network with a look-up layer that is concatenated 
before the first layer. The look-up layer has the parameter in dimension V x 
E, where V is the size of vocabulary, and E is the size of embedding. This 
module maps the token to continuous values. We expect the network knows 
what the removed token is and output the probability of the removed word in 
Y*), Turning the training, the probability of the removed word in the output 
layer, i.e., the K-th layer, is maximized. After training, we use the value in the 
look-up layer as the embedding of a token and the average result of each token in 
an API query as the embedding of the API. In this high-dimensional continuous 
space, the representation of tokens indicates the relationship between tokens 
so that subsequent modules can effectively use the semantic information in the 
token. 


3.3 Detection on Parameter Tampering 


To reduce the reliance of model on the amount of data and to enable it to 
learn effectively when the positive and negative samples are not balanced, we 
additionally classify the API embedding using a decision tree model. 

A decision tree model is a tree structure that describes how instances are 
classified, and it is composed of nodes and directed edges. Nodes are classified 
into two types: internal nodes and leaf nodes. Internal nodes denote a property 
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or attribute, while leaf nodes denote a class. Begin with the root node and test 
a specific feature of the instance; then, using decision tree classification, assign 
the instance to its child nodes based on the test results; at this point, each child 
node corresponds to a value of the feature. Recursively, instances are tested 
and allocated until a leaf node is reached. Specifically, suppose the training 
data consisting of all the API embedding is D, A is the feature group, Cp is 
the samples of class k, the dataset can thus be separated into D1, Dz,--- , Dn. 
Denote the samples in D; and in class Ck as Dig, the entropy of the dataset D 
can be calculated as 


K 
|Cr| [Ck] 
H(D) = log (5) 
>, [D] °? [DI 
and the conditional entropy of D given A is 
2 Dil 2 [Dil S |Disl, [Dis 
H(D| A) = L H (Di) = i E log, —* (6) 
2 |D| 2 |D| 2 |Di| >? [Dil 


and the information gain of D from A is defined as 
g(D, A) = H(D) - H(D | A) (7) 


During the training, the attribute with the largest information gain rate is 
selected as the test attribute each time, and the construction of the decision 
tree is completed from top to bottom. The Parameter Tampering Detection 
Module in CMPD is consisted of the well-learned decision tree. 


3.4 Context-Based Malicious Parameter Detection Framework 


Based on the previous analysis, we now illustrate the general architecture of the 
proposed Context-based Malicious Parameter Detection (CMPD) framework, 
which is shown in Fig. 2. 
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Fig. 2. General architecture of CMPD. 


The CMPD framework is consisted of the semantic extraction and learning 
module we detailed in Sect. 3.2, and the parameter tampering detection Module 
we detailed in Sect. 3.3. We first collect all the API access records in the form of 
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URL requests and feed all of the collected data into the semantic extraction and 
learning module. Therefore, the API request will be map to the high dimensional 
hidden space in the form of API vector representation, which contains the context 
information about the normal and abnormal parameters. We then collect all 
the API representations and feed them to the parameter tampering detection 
module, which can complete the malicious parameter detection without relying 
on the balanced and numerous data. The detection module will classify each 
API representation into a benign request or an abnormal request from the root 
to leaf nodes of the decision tree, as we illustrated in Fig. 2. 


4 Experiments 


4.1 Metric 


The experiments will focus on the identification of the parameter-tampering 
attacks, and the evaluation metrics used in the current work include precision, 
recall, F,. These metrics are calculated using the proportion of true positives 
(TP), true negatives (TN), false positives (FP), and false negatives (FN) in the 
classification results. TP and TN are the number of correctly classified malicious 
and legitimate API requests. FP is the numbers of normal API requests misclas- 
sified as malicious, while FN is the number of abnormal requests misclassified as 
legitimate API requests. Where the precision is calculated as 


sts TP 
Precision = TP 4 FP (8) 
the recall is calculated as 
TP 
I = 
Real = oo LPN (H 


the F; value is calculated as 


2 - Precision - Recall 
Fi = 10 
j Precision + Recall 110) 


4.2 Main Result 


The results of different methods on the HTTP DATASET CSIC 2010 are shown 
in Table 1. CRS stands for Core Rule Set, and PL stands for Paranoia Level, 
which is used to control the strictness of ModSecurity’s rule checking, with a 
smaller value indicating greater strictness. As the PL increases, the Precision 
value of ModSecurity increases while the Recall value decreases, resulting in a 
decrease in the F; value, indicating that the traditional method based on rules 
has a very limited effect. The effect of the SVM algorithm on detecting parameter 
tampering is weaker than that of the traditional method, most likely because the 
SVM algorithm is extremely dependent on the quality of the features, and the 


CMPD: Context-Based Malicious Parameter Detection for APIs 107 


features fail to indicate the distribution of data. Autoencoder are deep learning- 
based methods that perform better than traditional methods. The proposed 
CMPD outperforms all baselines, including traditional detection methods and 
learning-based methods, in terms of Fi. CMPD has a balanced precision and 
recall, indicating that our method has low false-positive and false-negative rates. 


Table 1. The comparisons on Precision, Recall, and Fı Score. 


Method Precision Recall Fl 
Modsecurity + CRS (PL = 1) 1 0.652 0.789 
Modsecurity + CRS (PL = 2) 0.936 0.685 0.792 
Modsecurity + CRS (PL = 3) 0.841 0.745 0.79 
Modsecurity + CRS (PL = 4) 0.682 0.791 0.732 
One-class SVM + 30 features [10] 0.596 0.587 0.592 
Stacked Autoencoder-+Isolation Forest [13] | 0.803 0.883 0.841 
Regularized Deep Autoencoder [9] 0.946 0.946 0.946 
CMPD 0.971 0.971 0.971 


4.3 Further Analysis 


Influence of the Number of Training Data. In practical situations, the 
samples of normal and abnormal accesses may be extremely unbalanced. To 
explore the effect of our model in a more demanding environment, we conducted 
experiments in two ways. We first reduce the number of training data to illus- 
trate the performance of CMPD when training data is not enough. The influence 
of the number of training data is shown in Fig. 3. We find that the classification 
performance of the model gradually increases with the increase of training data, 
and the classification results are consistent with the results in Table 1. When the 
complete training data set is used, CMPD achieves the best classification per- 
formance. Moreover, when the training data is only 20% of the original dataset, 
the Fı value can also achieve 0.89, indicating that CMPD is less sensitive to the 
amount of training data and is effective even with fewer data. 


Influence of the Ration Between the Number of Negative Examples 
and Positive Examples. Further, we randomly drop the samples of malicious 
queries in the dataset, and the performance of our method are shown in Fig. 4. 
We find that when the number of malicious samples decreases, the performance 
of the classification decreases accordingly, but even when it is reduced to 1% 
of the normal samples, the Fı value can still reach above 0.91, indicating that 
our model is effective even when the normal and malicious samples are extreme 
imbalance. 
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Fig. 5. Visualization of the parameters and parameter names that the model most 
concentrated on. 


Visualization of Model Concentration. Further, we collect the parameters 
and the parameter names that have a critical impact on the parameter tampering 
detection extracted by the model, and the visualization of these tokens is shown 
in Fig. 5. It can be seen that parameters such as email, login, and password, which 
are highly relevant to parameter tampering attacks, are correctly extracted and 
are considered to be of high importance, indicating that the model has different 
levels of attention to different parameters and successfully learns the features 
related to parameter tampering. 


Case Study. To show that our method can identify the tampering with param- 
eters, the tampering with parameter names, and the correspondence of URL 
and parameters or parameter names, we provide the results of case study in 
Table 2. As we illustrated in the Table2, if we tamper with the parameter of 
the API “http: //localhost:8080/tiendal /publico/entrar.jsp” and add the string 
“%11” behind the normal parameter “errorMsg=Credenciales+ incorrectas”, 
the CMPD framework will find that the parameters of the API are tam- 
pered. Similarly, if we tamper with the parameter name (from errorMsr to 
errorMsgBAC), CMPD still successfully detects the tampering, which shows 
that our method learns the correct relationship between the parameters 
and the parameter names. Furthermore, if we change the parameter of the 
“http: //localhost:8080/tiendal /publico/entrar.jsp” to the normal parameter of 
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Table 2. Case study on different types of tampering 


URL Parameters & Parameter names Tampering type Result 
http://localhost:8080/ errorMsg=Credenciales-+incorrectas / Benign 
iendal /publico/entrar.jsp 

http://localhost:8080/ errorMsg=Credenciales+incorrectas%11 | Tampering on parameter Malicious 
iendal /publico/entrar.jsp 

http://localhost:8080/ errorMsgBAC=Credenciales+-incorrectas | Tampering on parameter name | Malicious 
iendal /publico/entrar.jsp 

http://localhost:8080/ B2=Vaciar-+-carrito / Benign 
iendal /publico/vaciar.jsp 

http: //localhost:8080/ B2=Vaciar--carrito Parameter and parameter names | Malicious 
iendal/publico/entrar.jsp does not correspond to the URL 


another API, “http://localhost:8080/tiendal/publico/vaciar.jsp”, our method 
can also detect that the parameter and parameter names do not correspond 
to the correct URL. It is shown that CMPD successfully learns the correspon- 
dence between URLs, parameters, and parameter names, and we do not need 


to use different models to detect the parameter tampering attack for different 
APIs. 


5 Conclusion 


APIs are vital for data exchange between programs, but their widespread use 
has brought significant security risks. By modifying API parameters, the adver- 
sary can launch Web attacks such as SQL Injection and Cross-Site Scripting 
(XSS). API parameter tampering detection is critical to keep the system run- 
ning smoothly. To detect parameter tampering attacks, previous works always 
used rule-based or simple learning-based methods, while they ignore the API 
tokens’ contextual information and thus perform poorly. In this paper, we pro- 
pose a framework for detecting API parameter tampering attacks called Context- 
based Malicious Parameter Detection (CMPD). We first learn the distribution 
of parameters, parameter names, and URLs using a neural network language 
model and then use a tree model to detect malicious queries based on the high- 
dimensional API embedding. On the CSIC 2010 dataset, CMPD outperforms all 
baseline with Fi of 0.971. 
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Abstract. With the development of the Internet, cyber security events occur fre- 
quently, especially webpage tampering events account for a high proportion. In 
response to this phenomenon, this paper constructs a webpage tampering detection 
framework BCR. Based on the webpage to be detected, the webpage text data is 
segmented and extracted according to the webpage structure, the text features are 
extracted by using BiGRU model combined with context dependence, and then 
combined with the CRF to learn sequence state labeling named entities, the word 
vector is constructed by the extracted named entity and brought into the RCNN 
model for tampering detection. The experiment results show that the framework 
has achieved 95.37% precision, 95.35% recall and 95.34% F1-Score in webpage 
tampering detection, which is better than Textrank RCNN framework in webpage 
tampering detection. In practical application, it also achieved 95.13% precision 
and 93.25% recall. 


Keywords: Webpage tampering - Named entity recognition - Text 
classification - Bidirectional gated cyclic unit network - Conditional random field 


1 Introduction 


With the rapid development of the Internet, various cyber security incidents continue to 
occur, among which the proportion of webpage tampering events has always been high. 
How to quickly and accurately locate the tampered content in the webpage and rectify 
it in time is of great significance to reducing the loss of the site. 

At this stage, NLP technology is developing rapidly, text classification technology has 
a wide range of applications in various fields, and named entity recognition technology 
is becoming more and more mature. This paper is based on the named entity model 
to extract the named entities of the text in the webpage segment by segment, and then 
combined with the text classification model to identify the tampered text. 


2 Research Status 


At present, the commonly used webpage tampering detection methods are mainly 
through image recognition and comparison and rule-based detection. Yan Yufeng and 
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Shen Yong [1] proposed to capture the original image and real-time image of the web- 
page and detect the feature point information in the before and after images according 
to the image processing model, and calculate the similarity of webpages according to 
the feature point information to determine whether the webpage has been tampered 
with. This method has a good application in the detection of webpage tampering with 
relatively fixed content or low content update frequency, however, in the case of web- 
page tampering detection with high content update frequency and rich content, it will 
affect the model efficiency and detection accuracy. Hongwei R et al. [2] proposed to 
classify webpage attributes according to principal component analysis, and introduce 
corresponding rules for each category to realize the judgment of webpage tampering. 
This method has better effect and efficiency in the scenario of simple webpage structure, 
but the recognition accuracy will be affected when the web page attributes are complex 
and the rules cannot cover new objects. 

Named entity recognition is a popular research direction of NLP, and named entity 
recognition models have very good applications in big data research in many fields. 
The early named entity recognition mainly used the method of building a dictionary, 
which required a lot of labor costs. After continuous optimization and iteration, today’s 
named entity recognition model mainly relies on various machine learning algorithms 
to achieve. In the field of named entity recognition in cyber security, Chiu J et al. 
[3] proposed a method of combining BiLSTM-CNN to build a dictionary in a neural 
network to encode some words and then match them, this method has better F1-Scrore 
than other methods on open source datasets. Fan Xiaoxia et al. [4] proposed a method of 
constructing a named entity recognition system (DNER) for darknet market text based 
on Branwen’s open source darknet market data text using CBOW-CNN-BiLSTM-CRF. 
Of entity types, the system can significantly improve recognition. Yi F et al. [6] proposed 
a named entity recognition model based on regular expressions, entity dictionary, CRF 
combined with feature templates after considering the particularity and complexity of 
security entities, got good results. 


3 Research Content and Methods 


It can be seen from the above that most of the detection of webpage tampering, the final 
data carrier is text data, how to extract effective and well-characterized key words from 
the text data plays a decisive role in webpage tampering detection. Different webpages 
have different text complexity, there is often more noise text data in complex text, and 
the structure of complex text is more complex than simple text, which has a great impact 
on the extraction of key words with effective features. In view of the interference of 
complex text data, this paper designs and implements a framework that extracts text data 
segment by segment according to the structure of webpages, and then uses named entity 
model to extract named entities to construct text vectors and bring them into the text 
classification model for webpage tampering detection, including: Data Preprocessing 
Framework, BiGRU-CRF Named Entity Recognition Model, RCNN text classification 
model. 
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3.1 Data Sources 


The experimental data in this paper comes from the historical data of webpage tamper- 
ing monitoring in the threat intelligence data of Knownsec Security Intelligence Brain. 
The data is HTML text data, involving five types of websites of government, univer- 
sities, hospitals, transportation, and energy. It contains 20,000 untampered webpage 
data and 10,000 tampered webpage data. The tampered content involves pornography, 
gambling, novels, tripartite movie website, tripartite investment website and reactionary 
information. 


3.2 Data Preprocessing 


According to the above content and method, the original data is firstly extracted in 
segments according to the structure of the webpage, and then perform manual labeling 
and stop word filtering on the extracted data. 


Data Extraction. 1) Parse the HTML data. 2) Build a DOM tree. 3) Traverse the DOM 
tree to find the tag where the required text is located. 4) Extract the text data segmented 
based on the webpage structure from the returned HTML data according to the tag. 


Data Labeling 


1) Named Entity Labeling 
This paper uses the word segmentation tool Jieba to perform word segmentation 
and part-of-speech tagging on the text data. Since named entities are derived from 
nouns, data labeling is based on the nouns after word segmentation. According 
to the tampering content of the webpage, a total of 5 types of entity types are 
labeled, including: PER (person), ORG (company/organization), PLF (platform), 
OBJ (special noun), 0 (irrelevant word), to ensure that each segment corresponds to 
one Named Entity Labeling to serve as the data basis for subsequent model building. 
2) Text Classification Labeling 
According to whether it has been tampered or not, the text category is labeled 
as 0 (not tampered) and 1 (tampered). 
3) Label the page to which the text belongs 
Use each webpage domain name as the source label of segmented text data to 
facilitate subsequent positioning. 


Stop Word Filtering. Build a stop word database, including: webpage navigation 
vocabulary, website copyright statement vocabulary, common auxiliary words, special 
symbols, etc. 
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3.3 Text Vectorization 


Use word2vec to build text vectors. Word2vec has two models of CBOW and SKIP- 
GRAM in building text vectors. The CBOW model predicts the central word according to 
the context of the input text, and the SKIP-GRAM model predicts the context according 
to the central word. Based on the research background, this paper adopts the CBOW 
model to construct text vectors. 


3.4 BiGRU Model 


In the field of named entity recognition, the LSTM model has a wide range of applica- 
tions. In the LSTM model, a single module consists of three gate units: input gate, forget 
gate, and output gate. The input gate determines the necessary information to retain, the 
forget gate determines to discard the information, and the output gate shows the final 
result. In the GRU network, the three gating units of the LSTM model are replaced by 
the update gate and the reset gate. The update gate determines the amount of attention 
information, and the reset gate determines the amount of forgotten information. The 
reduction of gating units also reduces the parameters in the network, making GRU more 
concise and efficient than LSTM. BiGRU is a neural network model composed of two 
unidirectional and opposite GRUs, The current hidden layer state of BiGRU is jointly 
determined by the current input X;, the forward hidden layer state h;” , at time t — 1, 
and the backward hidden layer state h‘—, at time t — 1. The state of the hidden layer 
at time f: 


h? = G(X, h1) ad) 
hy = G(X he) (2) 
hi = wh”? + v hý + bi (3) 


The function G() is a nonlinear transformation of the input word vector, encoding the 
word vector at this moment into the corresponding hidden layer state, w, and 0, respec- 
tively represent the weights corresponding to h? and h> at time t, and b; represents the 
corresponding bias. Its structure diagram is shown in Fig. 1: 


3.5 CRF Model 


The Conditional Random Field (CRF) model is a special Markov random field. It is 
assumed that there are only observation values X and state values Y in the model. In 
the CRF model, each state value Y,, is only related to its adjacent state value, and its 
observation value X, is not has Markov properties. The CRF model needs to consider the 
correlation between the output state values. The feature function 0 can be used to learn 
the relationship between states. The CRF will output a sequence score, and normalize all 
sequence scores to find the path with the highest probability as the prediction sequence. 
The CRF model includes state feature function 0 and state transition function m. 
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Fig. 1. BiGRU model structure diagram 


State Feature Function. Only related to the current node, % represents the current 
weight of the feature function, that is:?0(Y;, Xj). 


State Transition Function. Related to both node i+ 1 and node i — 1, w represents the 
current weight of the transfer function, that is:wu(Yi+1, Yi-1, Yi, Xi). 

Suppose there are state feature functions 01, 02,..., d whose weights are 11, 9,..., 
ÙL, and transition state feature functions 41, (12,..., WK, Whose weights are w1, w2,...,@L,; 
for the sequence X = {X1,X2,...,X,}, the probability of the output sequence Y can be 
calculated as: 


P(Y|X) = 


Z(X) ap 0,0, (%;, Xi) + » OK LK (Yi+1, Yi—1, Yi, x») (4) 


of which: 
Z(X) = Sexe PLY X) + Downe Vis Yin YX) © 


Z(X) is the generalization factor, which can be seen as the sum of the scores of all 
output sequences. 

When the transition feature and state feature are represented by unified functions s 
and f, the probability of the output sequence Y is: 


P(Y|X) = exp ) sifi, X) (6) 


Z(X) 
of which: 
Z(X) =) ep sifi(¥,X) (7) 


When the CRF model is used for named entity recognition, its graph structure is 
shown in Fig. 2: 
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Fig. 2. CRF model structure diagram 


3.6 RCNN Model 


The RCNN model is a commonly used text classification model, and its structure is 
divided into three parts. 


Region-CNN Model. A bidirectional RNN model is used to obtain the context 
information of each word embedding, and its expression is: 


ci(wi) = f (Wayerwi-1) + Wene(wi-1)) (8) 


crwi) = f (Weyer (wi+1) + Wisnewi41)) (9) 


of which: 

cı(wi) represents the above of the word w;. 

cr(wj) represents the context of the word w;. 

e(wi) represents the embedding vector of word wj. 

Wa and Wor) are weight matrices, which transfer the above and below of the previous 
word to the above and below of the next word. 

Ws) and W(s,) are feature matrices, which combine the semantic features of the 
current word to the upper and lower parts of the next word. 


Computing Hidden Semantic Vectors. The context information obtained in the pre- 
vious step is merged with the expanded word embedding information, and the activation 
function is used to calculate the hidden semantic feature vector of the word w;. Expanded 
word embedding information is: 


Xi = [cı (wi); e(wi); er (wi)] (10) 
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Hidden semantic vector is: 


Ma tanh( WX; n bo) a1) 


Continuous Learning, Output Results. After continuous learning of TextCNN, max- 
pooling and fully connected layers, the classification result is obtained. 
The structure diagram of the RCNN model is shown in Fig. 3: 


max-pooling 


output 


Fig. 3. RCNN model structure diagram 


4 Experiment and Result Analysis 


4.1 Experimental Environment and Evaluation Indicators 


This experiment was performed in the following configuration: 

In this experiment, both the named entity model and the text classification model 
use the precision rate (PRE), the recall rate (REC), and the comprehensive evaluation 
(Fl-Score) as the model’s accuracy evaluation indicators. 
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4.2 Experimental Configuration 


Named Entity Recognition. The 30,000 pieces of data after data preprocessing are 
divided into training set, test set and validation set according to the ratio of 6:2:2. The 
distribution of the data set is as follows: 

In order to verify that the framework proposed in this paper is better, BiGRU-CRF 
model, BiLSTM-CRF model, and CNN-LSTM model are set up as comparison models. 
The three comparison model structures are shown in Table 3 (Tables 1 and 2): 


Table 1. Configuration table. 


Software and hardware | Configuration 

CPU i7-6700HQ @2.6 GHz 
GPU GTX 970 m 

Memory 16 GB 


Operating System 


Deepin 20.5 GNU/Linux 


Table 2. Named entity dataset partitioning. 


Data set Quantity (bar) 
Training set 18000 
Test set 6000 
Validation set 6000 


Table 3. Named entity vs model structure. 


BiGRU-CRF BiLSTM-CRF CNN-LSTM 
Layerl Input Input Input 
Layer2 Embedding Embedding Embedding 
Layer3 bgru blstm conv 
Layer4 dense dense Istm 
Layer5 crf_dense crf_dense dropout 
Layer6 crf crf time_distributed 
Layer7 - - activation 


The main parameter configuration of each model is shown in Table 4 (Table 5): 


Text Categorization. The 30,000 pieces of data after data preprocessing are divided 
into training set, test set and validation set according to 6:2:2. The distribution of the 
data set is as follows: 
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Model Layer Epoch Batch_size Active 
BiGRU-CRF bgru 30 256 tanh 
BiLSTM-CRF blstm 30 256 tanh 
CNN-LSTM lstm 30 256 softmax 


Table 5. Text classification dataset partitioning. 


Data set Quantity (bar) 
Training set 18000 

Test set 6000 
Validation set 6000 


Use two methods to build word vectors and then bring them into the RCNN model for 
comparison. They are: Named entities combined with RCNN model for classification, 
Text summarization combined with RCNN model for classification. The RCNN model 
epoch is set to 30, batch_size is set to 256, and the training process is shown in Table 6: 


Table 6. RCNN model training process 
Layer Output shape Active 
input (None, 12) - 
layer_embedding (None, 12, 100) — 
layer_convld (None, 8, 128) relu 
layer_max_pooling (None, 128) - 
layer_dense (None, 64) relu 
dense_1 (None, 2) softmax 


4.3 Experimental Results and Analysis 


Named Entity Recognition. The accuracy indicators of each model are shown in Fig. 4: 
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Fig. 4. Accuracy indicators of each named entity model 


In terms of recognition accuracy, the PRE, REC, and Fl-Score of the BiGRU-CRF 
model in this scenario are 93.88%, 91.36%, and 92.60% respectively, which is a certain 
improvement compared to the other two models. The main reason is that the data set is 
based on segmented text data after webpage structure segmentation, and the BiGRU-CRF 
model has improved and optimized the gate control unit compared with the BiLSTM- 
CRF model, and has better applications in simple text data. Both BiGRU-CRF model 
and BiLSTM-CRF model can encode text information from front to back and from back 
to front, which can better capture bidirectional text semantic dependencies, while CNN- 
LSTM model cannot encode text information from back to front, It can only capture 
one-way text semantic dependencies, so it is lower than the other two models in terms 
of accuracy. 

Figure 5, Fig. 6, and Fig. 7 show the evaluation indicators of each category of named 
entity recognition accuracy of each model: 

Compared with the other two models, the BiGRU-CRF model has obvious advan- 
tages in PLF named entity recognition, and is comparable to the BiLSTM-CRF model 
in other types of named entity recognition. The CNN-LSTM model is far behind the 
other two models in terms of OBJ and PLF named entity recognition. From the compre- 
hensive view of the above radar charts, BiGRU-CRF is relatively better in named entity 
recognition in this scenario. 


Text Categorization. The accuracy evaluation indicators of each model are shown in 
Fig. 8: 

Compared with TextRank-RCNN, BiGRU-CRF-RCNN has a certain improvement 
in precision, recall and Fl-Score. The main reason is that BCR framework extracts 
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Fig. 5. The precision of each model for each type of named entity recognition 
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Fig. 6. The recall of each model for each type of named entity recognition 


keywords representing text based on the characteristics of BiGRU-CRF model. Enti- 
ties can better represent the domain features and context features of the current text. 
While the TextRank-RCNN framework constructs a network based on the relationship 
between local adjacent nodes when extracting keywords representing text The mecha- 
nism of exclusive nouns, the extracted information features are not comprehensive, so 
the accuracy of tampering identification is relatively poor. 
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Fig. 7. Each model recognizes the Fl-Score for each type of named entity 
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Fig. 8. The accuracy index of each text classification model 


4.4 Practical Application 


This framework has been applied in Knownsec Security Intelligence Brain. From the 
test results, an average of 108,326 webpages are detected every day, and an average of 
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411 tampered webpages are identified every day. After manual sampling by the sampling 
team, the sampling precision was 95.13%, and the recall was 93.25%. 


4.5 Conclusion 


At this stage, named entities and text classification technology have been widely used 
in the field of cyber security, but less in webpage tampering detection. Therefore, 
the BiGRU-CRF-RCNN framework is proposed for webpage tampering detection. 
According to the above experimental process and practical application effect, we can 
get: 


Advantages of this Framework. Due to the structural characteristics of the gated unit 
of the BiGRU-CRF model, it has a better application than other models in this scenario. In 
terms of text classification, the named entities extracted based on the named entity model 
can better reflect the characteristics of the current field. Therefore, in the scenario of this 
paper, using the text vector constructed based on named entities for text classification 
has a better effect. 


Weaknesses of the Framework. The BiGRU-CRF-RCNN model achieves better 
results because the industry content of the website detected in production and exper- 
iments is less related to the tampered content. Considering the problem of model gener- 
alization, if the data surface is widened, and the positive samples and negative samples 
are related, it needs to be improved according to the actual effect. 
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Abstract. Nowadays, the cyberspace search engines have showed great power to 
find entities and services in the network, which provide new ideas and methods 
to detect the Bitcoin nodes. This paper introduces the Bitcoin’s P2P network and 
nodes including the reachable nodes and the unreachable nodes. Then, the results 
of detecting reachable nodes by the cyberspace search engines are showed. Next, 
the author proposes a new approach to find and verify the unreachable nodes by 
the cyberspace search engines. Finally, this paper illustrates the de-anonymization 
of some Bitcoin nodes by the cyberspace search engines, which map some node’s 
IP addresses to real Bitcoin entities, such as Zeblockchain (a browser website), 
Microwallet (a wallet website) and Laurentia Pool (a non-profit pool website). 


Keywords: Cyberspace search engines - Bitcoin nodes - Reachable nodes - 
Unreachable nodes - De-anonymization 


1 Introduction 


The cyberspace search engine is a new kind of network tool, which has attracted more 
and more attention from network researchers in recent years. It is different from the 
traditional Web search engine which takes Web pages as the retrieval objects, such as 
Google, Baidu and Bing. The Web search engine is widely crawling and storing web 
pages in the network, extracting and analyzing the pages’ content, and providing keyword 
retrieval services for the public. The cyberspace search engine finds the entities and 
services in the network by actively detecting, obtains the target’s information through 
protocol interaction and makes a comprehensive display. At present, the well-known 
cyberspace search engines in the industry include: Shodan (shodan.io, US), Censys 
(censys.io, US), BinaryEdge (www.binaryedge.io, EU), Zoomeye (www.zoomeye.org, 
CN), Fofa (offline now, CN), and so on. 

The cyberspace search engines commonly maintain a protocol library which contains 
a variety of protocols, deploy probes all around the world, detect the whole network using 
various protocols, and find open ports and services all the time. There are many kinds of 
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equipments in the network, including Servers, Network equipments, Terminals, Office 
facilities, Smart home devices, Industrial controlling equipments, Webcams, Blockchain 
entities, etc. [1] made a detailed comparative analysis of the well-known cyberspace 
search engines, compared their supporting protocols, detected equipments, equipment 
types, detecting capabilities, system structure, and probes, etc. 

Bitcoin is the most successful electronic crypto-currency in the world. It was pro- 
posed by Nakamoto in 2008 [2] and launched officially in January 2009. Bitcoin kept 
running stably since then and had become an important means for global finance and 
payment. The cyberspace search engines had strong infrastructures which offer pow- 
erful computing, abundant storage, and a large amount of detecting records. This gave 
a great convenience for the analysis of assets and equipments in the cyberspace. The 
well-known cyberspace search engines include Shodan, Censys, Zoomeye, Fofa, etc, 
which provide detecting services for Bitcoin nodes. In this paper, we will introduce our 
work of finding and analyzing Bitcoin nodes by cyberspace search engines. 

The contributions of this paper are as follows: 1) Introduce the results of detecting the 
Bitcoin reachable nodes by the cyberspace search engines. 2) Propose a new approach 
to find and verify the Bitcoin unreachable nodes by the cyberspace search engines. 3) 
Ilustrate the de-anonymization of some Bitcoin nodes by the cyberspace search engines, 
which map some node’s IP addresses to real Bitcoin entities. 


2 Bitcoin Network and Nodes 


Bitcoin system can be logically divided into the network layer and the transaction layer, 
as shown in Fig. 1. 


Transaction Layer 


Network Layer “ ___..- 


Fig. 1. Bitcoin’s logic structure 


The network layer is composed of a large number of Bitcoin nodes. Each node keeps 
working on broadcasting IP addresses, verifying transactions, packaging blocks, and 
mining independently. All the transactions are stored in the blocks, which connected each 
other by time order to form a blockchain. All the transactions are published to all network 
participants and stored in all nodes of the network. Previous studies mostly focused on 
the transaction layer, but less on the network layer. The Bitcoin network is a typical P2P 
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network without an organization or trust center. Nodes gain trust between each other 
through interactions, and form the network by themselves. The Bitcoin network is the 
foundation of Bitcoin system. 

The Bitcoin network can be divided into the visible part (reachable nodes) and the 
invisible part (unreachable nodes). The reachable nodes can receive incoming connec- 
tions and provide public services for the whole network. Generally, they would store a 
complete copy of the blockchain data. They open fixed ports (often 8333) waiting for 
connections and can be regarded as “Servers” in the network. 

The unreachable nodes do not receive incoming connections from outside and don’t 
provide public services for the whole network. They don’t keep a complete copy of the 
blockchain data and can be regarded as “clients”. The unreachable nodes are generally 
deployed behind a NAT (Network Address Translation) or a firewall and cannot be 
found by active detecting. Bitcoin nodes have different connectivity, service type, and 
topology, which have great impacts on the performance of Bitcoin system. Therefore, it 
is important to detect and study the Bitcoin nodes in depth. 


3 Detecting the Reachable Nodes 


Because the reachable nodes open fixed ports waiting for outside connections, we can 
easily detect all the reachable nodes by active detecting. There were many studies by 
far. Joan et al. [3]. Measured the Bitcoin network from November 2013 to January 2014, 
connected the Bitcoin nodes with bitcoin-sniffer (a open source tool) [4], collected 
872000 nodes, and analyzed the nodes’ geographical distribution, stability, propagation 
delay, etc. Christian Decker et al. [5] in 2013 and Giuseppe Pappalardo et al. [6] in 2016 
measured the Bitcoin network, observed the propagation delay of blocks and transactions 
in the network. In the same year, Fadhil et al. measured the Bitcoin network for a week 
and collected 6430 stable online nodes and 313676 client IP addresses [7]. Sehyun Park 
et al. carried out a comparative study [8] in 2018, developed a software Bitcoin-Node- 
Scanner, obtained and verified 1 million nodes’ IP addresses within 37 days, and counted 
the IP types IPv6/IPv4/Onion), geographical distribution, port numbers, client versions, 
protocol versions, etc. 

All these studies above were carried out by individual researchers. The cyberspace 
search engines have far more power than the single terminals. By scanning the whole 
network with probes all over the world, they can connect to all reachable nodes, get their 
information and make a time-based cumulative analysis. Fofa showed 56748 Bitcoin 
reachable nodes detected from December 2016 to September 2021 [9]. Zoomeye showed 
63504 Bitcoin reachable nodes [10] and display their information such as IP, open ports, 
open services, countries, affiliated enterprises, protocol slogans, geographic longitude 
and latitude, as shown in Fig. 2 below. 

It should be noted that [9] and [10] are only reachable nodes detected by the 
cyberspace search engines. Next, we will propose an approach to find and verify 
unreachable nodes by the cyberspace search engines. 
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Fig. 2. Zoomeye’s page of a reachable node 


4 Inferring the Unreachable Nodes 


The unreachable nodes don’t open ports for outside connections, so they cannot be found 
by active detecting. Even if we got some addresses of the unreachable nodes, we could 
not definitely verify them by active detecting due to the existence of “Churn Nodes” 
which caused by the network delay. 

There were a few studies on the unreachable nodes. Alex et al. proposed a de- 
anonymization method for the unreachable nodes [12] in 2014, which setup some 
probes connected to all entry nodes. When an unreachable node broadcasted a con- 
nection request through the entry nodes, the request would be forwarded to the probes 
and be recorded. The author believed that there were about 90000 unreachable nodes 
at that time. Till et al. simulated the Bitcoin network in 2016 and analyzed the broad- 
casting of transactions in the network [13]. It was estimated that the total number of 
was about 16000 then. Liang et al. deployed 102 probes around the world to collect 
the connection requests [14] in 2017, and estimated that there were 155000 unreach- 
able nodes in the whole network. Matthias et al. monitored the “unsolicited” ADDR 
messages [15] in 2021 and could identify about 31000 active unreachable nodes every 
day. Federico et al. studied in detail the roles and number of unreachable nodes in Bit- 
coin network [16], and proposed an improved transactions broadcasting protocol, which 
improved the efficiency and security of the Bitcoin network. Alex et al. introduced the 
Bitcoin network based on Tor [17], proposed a man-in-the-middle attack against Bit- 
coin, and analyzed the delay caused by unreachable nodes in the attack. Indra et al. 
proposed a de-anonymization method for the unreachable nodes [18] by collecting all 
GETDATA messages and matching IP addresses with transactions/blocks. The accuracy 
of identifying the unreachable nodes is up to 90%. 
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Next, we will propose an approach to infer the unreachable nodes by the cyberspace 
search engines. First, we setup a fake client to actively connect to the reachable nodes, 
obtained a large number of Bitcoin nodes’ IP addresses by the interaction mechanism of 
GETADDR-ADDR. Then, we input these addresses to the cyberspace search engine and 
obtain all the feedback records of the engines. Finally, by analyzing the open services 
and detecting time of the target IP, we can infer the unreachable nodes. Here we make 
two judgments. 

Judgment 1: If an IP had a record of opening Bitcoin service with a new timestamp 
(within the duration of detecting cycle), this IP stood for a reachable node. 

Judgment 2: If an IP had a record of opening other services (HTTP, SSL, etc.) except 
for Bitcoin service and the timestamp was relatively new(within the duration of detecting 
cycle), this IP stood for an unreachable node. 

The correctness of Judgment | is obvious. The correctness of judgment 2 is also easy 
to understand. Because if we can verify a real IP from the Bitcoin system is opening 
other services, but isn’t opening the Bitcoin service, the IP must stood for an unreachable 
node. The worldwide probes and all-weather scanning of the cyberspace search engines 
made sure that the “unreachable” of nodes were not caused by “ network delay “. In fact, 
we have made experiments to testify Judgment 2 and the accuracy was up to 95%. 
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Fig. 3. Zoomeye’s page of a unreachable node 
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Here is an example. As shown in Fig. 3, the node “167.172.158.149” is a real IP 
address obtained from the Bitcoin system. We input the IP into Zoomeye and check the 
feedback records. It can be seen that the IP opened “SSH” and “TCP” service, but didn’t 
open “Bitcoin” service and the detecting timestamp was “2022-03-24”. So the node 
“167.172.158.149” is an unreachable node. 

To make a Ground-Truth test, we deployed a Bitcoin probe on Vultr. By checking 
the real neighbors using “peerinfo” command, we found the node “167.172.158.149” 
was its neighbor and the attribute is “inbound = true”. The node made an incoming 
connection to our probe and was an real unreachable node. 


5 De-anonymization of the Bitcoin Nodes 


As an encrypted digital currency, Bitcoin protects the privacy and security of users’ 
transactions. However, many researchers are very interested in the de-anonymization of 
Bitcoin addresses and tracing the route of transactions. By far, there are many studies on 
this issue being published. The existing methods are mainly based on the clustering of 
transaction addresses. For example, Butian Huang et al. proposed a clustering algorithm 
“BPC” [19] based on the nodes’ behaviors, which clustered the nodes after behavior 
similarity measurement. The experiment showed that the accuracy was higher than the 
previous algorithms. Annika Baumann et al. analyzed the Bitcoin’s transaction graph 
[20], inferred that there was a close relationship between network usage and exchange 
rate, and de-anonymized the 11 largest entities in the transaction graph. Meng Shen 
et al. analyzed the transaction propagation mode, proposed a method to obtain the initial 
transaction by calculating the pattern matching score [21], and established the association 
between the transaction and the initiating node’s IP. The experimental accuracy was up 
to 81.3%. 

In the de-anonymization of the Bitcoin nodes, it’s important but difficult to find the 
association between a node’s IP and the real network entity (exchages, browsers, wallets 
or pools), because many important entities keep their IP addresses highly confidential 
for the reason of privacy. The cyberspace search engines provide new ideas for the 
association between Bitcoin nodes’ IP and the real entities. The cyberspace search 
engine detect the whole network using various protocols, and will find all services support 
by anode. For the reachable nodes, all the services such as HTTP, HTTPS, SSL, Bitcoin 
will be found together. As some persons or companies may open different services in 
One IP address, we could get extra information for a Bitcoin node by visiting its HTTP 
page. In some cases, we could get useful information such as the geographic location, 
organization information, and services operated by the website. Here we gave some 
examples. 


1) A node with IP (147.135.252.43). This IP address is a Bitcoin browser “Ze- 
blockchain’, belonging to Japan Digital Service Company, as shown in Fig. 4. 
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Fig. 4. Zeblockchain (a Bitcoin browser) 
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2) A node with IP (216.108.227.39): This IP address is a Bitcoin wallet “Microwallet”, 


operated by a US company, as shown in Fig. 5. 
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Fig. 5. Microwallet (a Bitcoin wallet) 
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3) A node with IP (51.81.56.49): This IP is a Bitcoin Pool “Laurentia pool’, which is 
a non-profit mining pool(open source), as shown in Fig. 6. 


A Bitcoin Pool 


Development 


Fig. 6. Laurentia (a Bitcoin pool) 


Limitations: This method could only de-anonymize some Bitcoin websites which 
open different services in One IP address. If large organizations have many IP addresses 
and don’t deploy different services on same IP address, this method is no longer 
applicable. 


6 Summary 


This paper introduces the working principle of the cyberspace search engines and dis- 
cusses their application in detecting the Bitcoin nodes. The Bitcoin network is composed 
of visible part (reachable nodes) and invisible part (unreachable nodes), which have dif- 
ferent characteristics. The reachable nodes provide public services for the network and 
easy to detect, while the unreachable nodes are only clients and hidden in the network. 
The number of the unreachable nodes is about ten times to the reachable nodes [14], 
which are not easy to detect and analyze. 

The author introduces the results of detecting Bitcoin nodes by the cyberspace search 
engines, then proposes a new approach to verify the Bitcoin unreachable nodes, finally 
illustrates the de-anonymization of the Bitcoin nodes which could find the association 
between a node’s IP and the real network entity (a exchage, a browser, a wallet or a 
pool). By far, the cyberspace search engines can only detect Bitcoin nodes with Ipv4 
addresses, and Ipv6 addresses are not supported. However, with the fast improvement 
of the cyberspace search engines, they will play more important roles in the detecting 
and analyzing of the Bitcoin network. 
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Abstract. Traditional anti-anonymity technologies for Bitcoin transactions 
include two types. One is network-layer anti-anonymity technology, which 
achieves the purpose of locating the initial IP of specific transaction informa- 
tion by speculating on the IP propagation path of transaction; the other is the 
anti-anonymity technology of the transaction layer. By analyzing the data of the 
Bitcoin ledger, it realizes the on-chain behavior portrait of a specific wallet address 
attributable to the user. In this work, we propose a new anti-anonymity technology, 
by constructing transaction behavior vectors and social behavior vectors based on 
Bitcoin ledger data and off-chain social data respectively, and build a model for 
mapping and aligning the two vectors. Experimental test shows that the proposed 
anti-anonymity technology is more accurate and has better practical effects. Fur- 
thermore, the technology suits for the anti-anonymity of other virtual currencies 
as well. 


Keywords: Bitcoin - Virtual currency - Anti-anonymity - Behavior vector 


1 Introduction 


Bitcoin is a purely peer-to-peer version of electronic cash [1], which allow online pay- 
ments to be sent directly from one party to another without going through a financial 
institution. It relies on digital signatures to prove ownership and a public history of 
transactions to prevent double-spending. Bitcoin does not rely on third-party credit, has 
strong anonymity. It mainly reflects three aspects: one is the anonymous transaction 
address. Bitcoin transaction address is created by the user independently, independent 
of user identity information, and does not require third-party participation to create and 
use the address; Second, the fragmented transaction behavior. Bitcoin system supports 
users to generate different addresses for each transaction. User transaction information 
can be arbitrarily dispersed in different anonymous address behaviors. Third, the source 
of Bitcoin transaction package is difficult to find in network. Bitcoin communication 
network uses P2P protocol, and there is no central node. Transaction information broad- 
casts all over the network. It is difficult to track the origin of transaction information by 
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monitoring a single server. Because of its strong anonymity, Bitcoins are often used in 
gambling, illegal fund-raising, fraud, pyramid sale, money laundering and other illegal 
activities. 

Traditional Bitcoin transaction anti-anonymity technology mainly includes two 
types: one is the network layer anti-anonymity method, which mainly detects and col- 
lects the transaction information broadcast by the Bitcoin network layer, analyzes the 
propagation path of a specific Bitcoin transaction in the P2P network, infers the IP 
address of the originating service node of the transaction, and then locates the user IP of 
the transaction. Another method is the anti-anonymity method at the transaction level, 
which mainly obtain user portrait information for a specific wallet address by analyzing 
transaction relationships between different transaction addresses, especially with the 
help of the labels of the addresses of exchanges, mining pools and other institutions. The 
above two types of anti-anonymity technologies are not effective because they cannot 
track the source of the user’s social identity information to which the transaction address 
belongs. 

Because of the shortcomings of the traditional anti-anonymity technology of Bitcoin 
transaction, this paper integrates the data on and off the chain, studies and proposes an 
anti anonymity technology of Bitcoin transaction based on behavior vector mapping and 
aligning model. Build a social behavior vector based on off chain social data, and estab- 
lish a mapping and aligning model with the transaction behavior vector based on Bitcoin 
ledger data, which can realize the anti-anonymity of Bitcoin address and transaction. 
Because the social behavior vector contains the real social identity information of users, 
this paper proposes anti anonymity technology, which has better practical effect than the 
traditional anti anonymity technology. 


2 Bitcoin Transaction Overview 


Every transaction in the Blockchain has a list of inputs and outputs, where each includes 
addresses that were used in the transaction and the amount of coins spent in that transac- 
tion. Inputs of the current transaction come from the outputs of the previous transaction, 
and the output of the current transaction will be used as the input in other transactions, 
which to form a transaction chain (see Fig. 1). 


Transaction 


Input | | Output — 
[ curs} —> 
— "| Input Output 


Fig. 1. Bitcoin transaction chain 
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There will be either a single input from a larger previous transaction or multiple 
inputs combining smaller amounts, and at most two outputs: one for the payment, and 
one returning the change, if any, back to the sender, which will be automatically selected 
by the Bitcoin client as the input in future transactions. 

Bitcoin transactions can be roughly divided into two types: the first type is mining 
reward transactions. Each block has a mining reward transaction. This kind of transaction 
has no input but only output. The system transfers the mining reward of this block and 
fee of the transaction contained in the block to the output; The second type is ordinary 
transactions, including several inputs and several outputs. 

Since multiple input addresses of a transaction correspond to different private keys, 
Bitcoin transferring the input needs the signature of the corresponding private key; 
Therefore, it is generally believed that multiple input addresses of a transaction belong 
to the same entity. So, with the help of transaction address clustering, the decentralized 
transaction behaviors of the same entity in the ledger can be gathered, which is convenient 
to master the behavior characteristics of the entity. 

There are four kinds of transaction address clustering technology [2]. One is the 
clustering technology based on multiple input addresses. Multiple input addresses of a 
transaction belong to the same address cluster; The second is the clustering technology 
based on the change address. The change address of a transaction belongs to the same 
address cluster as the input address. At the same time, through the change address as 
the connecting link, the input addresses in the two transactions can be combined into 
the same address cluster; the third is the clustering technology based on mining reward 
transaction. Multiple output addresses of a mining reward transaction belong to the same 
address cluster. The fourth is the comprehensive clustering technology combining the 
above three clustering technology. 


3 Transaction Scene Graph Structure 


Bitcoin transaction scene include mining reward, depositor withdrawal on the exchange, 
gambling, blackmail, MLM fraud, etc. Among them, deposit and withdrawal of Bitcoin 
on the exchange are more popular. 

Deposit transaction transfer Bitcoin held by the user’s personal wallet address to the 
deposit wallet address assigned to the user by exchange. The private key of the deposit 
wallet address is controlled by the exchange, and different deposit wallet addresses 
correspond to different users. Deposit transactions include customer to customer (C2C) 
transaction scene and business to customer (B2C) transaction scene. 

The general characteristics of the graph structure of C2C deposit transaction are: a 
small number of transaction input and two outputs, one of which including user’s deposit 
wallet address, and the cluster label of this address is the name of exchange (see Fig. 2). 


142 S. Lin et al. 


C2C Deposit Transaction scene 


Input | Deposit Output 


| | Change Output 


Fig. 2. Graph structure of C2C deposit transaction scene 


B2C deposit transaction scene graph has a 1-to-N structure, which is generally char- 
acterized by a small number of transaction input addresses and a large number of transac- 
tion output, in which the output addresses are deposit wallet addresses of a large number 
of different users, and the cluster labels of different output addresses are the same or 
different exchange (see Fig. 3). 


B2C Deposit Transaction scene 
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Fig. 3. Graph structure of B2C deposit transaction scene 


Withdrawal transaction transfer Bitcoin hosted on the exchange to the wallet address 
specified by the user. In order to reduce the transaction fee, exchange usually collects 
multiple users’ withdrawal order and transfers Bitcoin to multiple users’ wallet addresses 
in one transaction. 

The graph structure of withdrawal transaction has the characteristics of a 1-to-N 
structure. The cluster labels of transaction input addresses are the same exchange, and 
the transaction output addresses are specified by a large number of different users (see 
Fig. 4). 

Each transaction needs to pay fee, in reality, there is a combination of deposit trans- 
action and withdrawal transaction, that is, user withdraws Bitcoin on a exchange and 
deposit it to another exchange. 
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Withdrawal Transaction scene 


—_———2 
— Exchange Input | Withdrawal Output 1 pp æ> 
—e . ) Withdrawal Output 2 p> 
Sere a J 


Withdrawal Output 3 p> 


— 


.— 


Change Output n 


J 


Fig. 4. Graph structure of withdrawal transaction scene 


4 Traditional Bitcoin Anti-anonymity Technology 


Traditional Bitcoin anti-anonymity technology mainly includes network layer anti- 
anonymity technology and transaction layer anti-anonymity technology. 

Network layer anti-anonymity technology [3] refers to collecting transaction packet 
transmitted by Bitcoin P2P network, analyzing the propagation path of a specific Bitcoin 
transaction packet in P2P network, and inferring the server IP of the first broadcast node. 
For example, koshy et al. [4] used special transactions to find the originating node. Most 
normal transactions will be forwarded once by multiple nodes, while transactions with 
wrong format will only be forwarded once by the originating node. Therefore, this feature 
can be used to identify the originating node of special transactions. However, due to the 
small proportion of special transactions, the effect of this method is limited. In addition, 
biryukov et al. [5, 6] proposed a transaction traceability mechanism based on neighbor 
nodes, which can improve the traceability accuracy by taking neighbor nodes as the 
judgment basis. However, the scheme needs to continuously send packet to all nodes in 
Bitcoin network, which may cause serious interference to Bitcoin network. 

The network layer anti-anonymity technology has a certain probability to speculate 
the initial service node IP of the transaction. Gao Feng, Mao Hong-liang and others [3] 
have achieved the anti-anonymity traceability accuracy with a recall rate of 60% and an 
accuracy rate of 35.3%. The traceability and positioning from the service node IP to the 
end-user IP needs to be combined with the operator’s traffic analysis technology and IP 
positioning data. 

Transaction layer anti-anonymity technology refers to finding the correlation 
between different Bitcoin addresses by analyzing transaction records in Bitcoin ledger, 
so as to infer the transaction behavior law and capital flow of the transaction address. 
Liao et al. [7] analyzed the blackmail process of the blackmail software crypto locker 
by analyzing the Bitcoin ledger data, found multiple Bitcoin addresses belonging to 
blackmail organizations, and identified a large number of Bitcoin ransom transactions. 
Meiklejohn et al. [8] used heuristic cluster analysis technology to identify multiple Bit- 
coin addresses belonging to the Silk Road website. Guo Wen-sheng et al. [9] studied how 
to realize the division of Bitcoin entities with different types of characteristics through 
machine learning of Bitcoin ledger data. 
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Transaction layer anti-anonymity technology can analyze and speculate the charac- 
teristics of the trading behavior on the chain of a specific wallet address. Combined with 
the anti-anonymity label information of the exchange, mining pool and other platform 
institutions, it can speculate the ownership of some wallet addresses, but it is difficult to 
determine the user’s social identity information. In reality, many Bitcoin hacking inci- 
dents generally analyze the transaction data of Bitcoin ledger, track the exchange into 
which Bitcoin is transferred, and coordinate the exchange to provide user information 
of the Bitcoin addresses. 

In recent years, the research on Bitcoin anti-anonymity technology by integrating 
data on and off the chain has gradually become a research hotspot. Husam et al. [10] found 
that Tor Network anonymous services and users by integrating online social network 
data and Bitcoin ledger data. 


5 Behavior Vectors Mapping and Aligning Model 


Due to the anonymity of Bitcoin transaction address and trading process, and the poor 
readability of Bitcoin ledger data, most centralized institutions or platforms, such as 
exchange and mixed service, will synchronously record the user identity information 
and behavior information corresponding to Bitcoin ledger data. The above data is called 
social data off chain. Although it does not contain Bitcoin address, making full use of this 
data can realize the positioning and anti-anonymity of transaction behavior of Bitcoin 
ledger data. 

We define social behavior vector S including five dimensions: [time, value, scene, 
name and account]. Time is the time when user receives social data, value is the number 
of Bitcoinin social data, scene is the transaction scene describing in social information, 
name is the platform name, and account is the user’s social account. If only time and 
value are considered, and the transaction scene, platform name are missing or ignored, 
the accuracy of anti-anonymity will be affected in some complex cases. 

Like social behavior vector, we define transaction behavior vector E including seven 
dimensions: [time, value, scene, input label, output label, input address, output address]. 
Time is the transaction time recorded in the Bitcoin ledger, value is the number of Bitcoin 
in transaction output, scene is the transaction scene inferred through graph structure 
analysis, input label is the clustering label of the transaction input address, output label is 
the clustering label of the transaction output address (non change address), input address 
is the transaction input address and output address is the transaction output address (non 
change address). If transaction behavior vector E and social behavior vector S satisfy 
the following conditions: 


© Difference between S.time and E.time is small, that is, the social time is close to the 
Bitcoin ledger transaction time, such as less than 10 min; 

© S. Value is equal to E.value, that is, the transaction values on and off the chain are 
consistent; 

© S. Scene is equal to E.scene, that is, the trading scenarios on and off the chain are 
consistent; 

® For deposit transaction, S.name is equal to E.output lable, that is, the name of the 
platform name is consistent with the address clustering label on the chain. 
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Then, user’s social account S.account corresponding to Bitcoin transaction address 
E.output address can be considered. Because user’s social account is more unique and 
social than the IP and user behavior portrait, and can better reflect user’s social identity 
information. 


6 Experiment and Result Analysis 


In order to research and prove the alignment model of behavior vector mapping on 
and off the chain, the anti anonymity of Bitcoin transaction can be realized more accu- 
rately. We conducted an experimental test on the charging transaction of a platform. The 
experimental process is as follows: 


© Recharge the two deposit wallet addresses assigned by the exchange, then receive 
26 social messages sent by the exchange through two social accounts. 26 social 
messages correspond to 26 social behavior vectors, including 11 social behavior 
vectors belonging to social account A and 15 social behavior vectors belonging 
social account B. The sample data of social behavior vector after anonymized is as 
follows: [‘2020-05-12 14:24’, ‘0.010 *’, deposit, * exchange, ‘account’ ] 

© Determine the time window of Bitcoin ledger data. In this experiment, the start time 
of Bitcoin ledger data is greater than or equal to the social behavior vector’s time 
minus 20 min, and the end time is less than or equal to the social behavior vector 
time plus 10 min. 

© Extract time and value fields in each social behavior vector, match with the output 
value of Bitcoin ledger transaction output in the time window, choose Bitcoin ledger 
transactions output with equal value. 

® Analyze the graph structure of the transaction, and choose transaction whose 
transaction scene is the same as the social behavior vector’s scene. 

© For the transaction output address, choose address whose cluster label is consistent 
with the exchange’s name in the social behavior vector. 


The experimental results are shown in the following table (see Table 1): 


Table 1. Anti-anonymity experimental results of Bitcoin transaction 


Social account Social Matched social Matched Deposit address 
behavior vectors behavior vector address 

A 11 11 1 Yes 

B 15 15 1 Yes 


Eleven social behavior vectors of social account A are respectively aligned with 
eleven C2C deposit transaction behavior vectors, and these Bitcoin transaction behavior 
vectors belong to one Bitcoin address, which is also the deposit address opened by the 
exchange for user A. Fifteen social behavior vectors of social account B are respectively 
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aligned with fifteen B2C deposit transaction behavior vectors, and these Bitcoin trans- 
action behavior vectors belong to one Bitcoin address, which is also the deposit address 
opened by the exchange for user B. 


7 Conclusion 


The anti-anonymity technology of Bitcoin transaction based on behavior vector mapping 
and aligning model proposed in this paper, realizes the fusion analysis of data on and 
off the chain. Compared with the traditional anti-anonymity technology, it has stronger 
practical effect. At the same time, the anti-anonymity technology proposed in this paper 
is also applicable to the anti-anonymity of other virtual currencies, such as Ethereum 
Coin and Tether USD. 
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Abstract. Deepfake videos created by generative-base models have 
become a serious societal problem recently as been hardly distinguishable 
by human eyes, which has aroused a lot of academic attention. Previous 
researches have made effort to address this problem by various schemes 
to extract visual artifacts of non-pristine frames or discrepancy between 
real and fake videos, where the patch-based approaches are shown to be 
promising but mostly used in frame-level prediction. In this paper, we 
propose a method that leverages comprehensive consistency learning in 
both spatial and temporal relation with patch-based feature extraction. 
Extensive experiments on multiple datasets demonstrate the effective- 
ness and robustness of our approach by combines all consistency cue 
together. 


Keywords: Deepfake detection - Digital forensics - Video classification 


1 Introduction 


“Seeing is believing” is hardly true in present days with the prosperity of com- 
puter science and information technology, especially the massively emerging 
applications of artificial intelligence. Although image and video forgery is never 
a new topic since the beginning of photography, open source applications rep- 
resented by Deepfakes [7] and others have brought this problem into a whole 
new level. Face manipulation in visual content has become a effortless task with 
the help of deep learning based generative models like variational autoencoders 
(VAEs) [16] and generative adversarial networks (GANs) [12], that anyone can 
produce fake videos with false identity or manipulated expressions and move- 
ments (known as “Deepfake Videos”) in several minutes without expert knowl- 
edge. Some of them have already been found to create malicious videos that 
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violate citizen privacy or attack public figures like fake pornography and black- 
mailing, which may easily lead to catastrophic results with today’s mass media 
and social networks. The detection of deepfake videos has become an hot and 
urgent issue. 

Various methods have been proposed in recent years by academic community 
to effectively recognize this particular type of forged images and videos. Since the 
majority forgery methods share a common image stitching pipeline including face 
detection, warping and blending, early researches address this task by detecting 
suspicious artifacts left in the stitching process within frame-level, such as face 
warping artifacts [20] and blending boundaries [18]. To yield a result for whole 
video clip from frame-level prediction, they usually cascade frame-level model 
with a merge module, or sometimes simply use weighted average. But ignor- 
ing the dependency among consecutive frames tends to produce sub-optimal 
combination. Frequency-based approached [9,25] have also been included to 
fully utilized temporal relation. Self-consistency is another crucial concept in 
image forensic [14,35], where patch-based and feature-map based method have 
all shown promising results. Although the detection accuracy on datasets has 
improved significantly with different approaches presented, forgery techniques 
are also evolving on reducing these artifacts, which forms an ever-changing arms 
race. 

In this work, we aim to catch both the intra-frame discrepancy during image 
stitched and the inherent flaws of inter-frame disalignment for more effective and 
robust deepfake detection. Our contributions can be summarized as follows: 


— We propose a comprehensive self-consistency learning(CSCL) model to 
explore the intrinsic discernible evidence between pristine and deepfake videos 
with both spatial and temporal consistency learning. 

— To achieve more effective and robust deepfake detection, we also proposed 
C? Loss, namely comprehensive consistency coordination loss, which tackles 
the inherent defects within deepfake producing pipeline as been created frame- 
by-frame without sequential knowledge. 

— Experiments conducted on multiple datasets demonstrate the effectiveness 
and robustness of our approach. Especially, best performance is reported in 
cross-dataset and low quality tests. 


2 Related Work 


Frame-Level Detection Methods. The emerging of deepfake videos on the 
internet raise a lot of concern to both industry and government in the past few 
years. Early researches [1,26] tend to address this problem by a simple classifi- 
cation model with a well-designed backbone. And some [20] simulated the gen- 
eration process of deep forgery to better obtain artifact of fake video pipeline. 
Not only in academic society, a one-million bounty real-world deepfake detec- 
tion competition was held by Facebook with the concern of its endangerment of 
social media to encourage optimal deepfake detection methods being proposed. 
Plenty of classification model was proposed and achieve really amazing results 
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beyond expectation. The winner of this competition |27] adopts the state-of-the- 
art image classification backbone efficientNet [28] as the main component of his 
model, and a novel data argumentation strategy contributes a lot to his final 
ranking. The runner-up of this competition fulfilled their method afterward [34], 
which treat the deepfake detection task as a fine-grained classification task and 
explicitly refine attention maps by regional independence loss. Merging with tex- 
ture features extracted at the front-end layer, their model achieve state-of-the art 
results in several datasets. Attention map prediction scheme is also considered 
by [6]. In their work, forged area is predicted in both learning-base and dictio- 
nary learning ways, binary classification and attention map regression tasks are 
trained using a multi-task loss function. 

The above mentioned detection methods are all concentrate on the RGB- 
domain of deepfakes, and there are some other works try to explore fake clues 
inherent in the frequency domain of deepfake images. Discrete cosine transform 
(DCT) [2] is adopted in [25], frequency layout of image is fully handled in both 
global and local views. In combine with learnable frequency-aware component, 
nonaligned infomation can be reliably detected at frame-level. Frank et al. [9] 
also leverage DCT in detection, and analysis which part makes synthetical deep- 
fake image detectable. Their results suggest that up-sampling blocks left unique 
fingerprint, but those frequency clues are not robust to perturbation. 

Besides, some other approaches tried to inspect artifacts from the side-view. 
FakeSpotter [31] do not directly use the features extracted by backbone net- 
work, but regard the neuron behaviors as the basis of discrimination, which is 
aimed to achieve more robust detection. To better leveraging the time factor 
into consideration in video-level authentication, spotting bio-metrics clues like 
eye blinking [19,32] and head posing sequential [32] is the first and most nat- 
ural insight. DeepRhythm [24] exposes deepfake counterfeits by monitoring the 
heartbeat rhythms associated with minuscule periodic changes of skin color due 
to blood pumping through the face. 


Video-level Detection Methods. Most video-level methods regard video as 
set of independent frames, and simply take the average confidence score of frames 
as the basis of judging the authenticity of video. Those methods actually follow 
the frame-level perspective, and neglect the interconnection between successive 
frames. 

Giiera et al. [13] adopt a natural way to leverage both advantages of convolu- 
tional neural network (CNN) and recurrent neural network (RNN) by using CNN 
for per-frame feature extraction, and RNN for temporal inconsistencies explo- 
ration. But inter-frame inconsistency modeling is not well considered in their app- 
roach. Tariq et al. [29] consider the artifacts introduced by the non-consecutive 
frames, and developed a convolutional LSTM-base residual network to achieve 
temporal feature learning. Basic features of the human body like eye blinking and 
head pose moving are utilized in [19] and [32] to distinguish the real from fake. 
In [23], the authors leverage the relationship between visual and audio patterns 
extracted from the same video to determine whether it has been modified. 
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Video-level detection gathers more information and, in general, should deliver 
better performance. But strangely, video-level evaluation results in terms of ACC 
and AUC are somehow lower than those at the frame-level. Zi et al. [37] propose 
two models by stacking ADD block. In their experiments, ADDNet-3D report much 
lower detection accuracy than ADDNet-2D, about 10 percents gap at a challeng- 
ing dataset. Ganiyusufoglu et al. [11] adopt the state-of-the art structures used in 
action recognition task, and evaluate their performance in deepfake detection. 


Benchmark Datasets. Several comprehensive deepfake datasets were published 
in recent years which greatly promotes the performance of deepfake detection 
methods. One of the most popular dataset is FaceForensics++ (FF++) [26]. It 
contains two graphic based approaches, namely Face2Face [30] and Faceswap [8], 
and two learning based methods include Deepfakes and Natural Textures [15]. Both 
face swap and face reenactment are covered. Celeb-DF [21] is one of the most chal- 
lenging dataset in deepfake detection task with clear identity label and pixel level 
annotation. During the deepfake generation stage, they scrutinizes carefully about 
several problems during fake video generation, including color mismatch, inaccu- 
rate face masks and video temporal flickering. With more attention drawn to this 
research topic, some new and better annotated datasets been proposed with more 
specific purpose recently like WildDeepfake [37] for real-world challenge and Open- 
Foreinsic [17] for multiple face scenario. 


3 Approach 


Given an input video with certain human activity, our goal is to detect if the iden- 
tity is replaced or facial expression of character is manipulated. We propose CSCL 
network as shown in Fig. 1 to improve the robustness and generalization ability of 
deepfake-style forgery video detector with the help of self-consistency by measur- 
ing the comprehensive spatial and temporal discrepancy within the image stream. 
To be more precise, our method mainly exploit a comprehensive consistency 
which tackles the substantial drawback of deepfake videos producing pipeline: 


— Intra-frame: Spatial consistency. Intra-frame consistency in deepfake video 
are mostly provided by blending algorithm like Gaussian blur or Poisson fusion, 
which has been proved to be distinguishable. 

— Inter-frame: Temporal consistency. Common generative models with 
frame-by-frame swapping process can not guarantee a smooth temporal 
momentum, while most of the former consistency learning methods only focus 
on single manipulated frame and tend to be overfit in one subset of manipula- 
tion. 

— Comprehensive consistency coordination. Extra blur and filtering can 
conceal the intra-frame discrepancy, and inter-frame consistency may also be 
diminished by adaptive average blending. Unlike previous work, we utilize inter- 
and intra-frame consistency coordination for more robust deepfake video detec- 
tion. 
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Fig. 1. Framework of CSCL network 


3.1 Problem Formulation 


We formulate the video-level deepfake detection task at beginning. Dataset D = 
{X;, L;}4_, consists ofn pairs of video-clip and it’s label with fake or real denoted 
as L; = {0,1}. Video clip can be seen as multiple consecutive frames X, = {x; He. 
where x, € ROXHĦXW is the t-th frame of video Xv, and the total number of frames 
is denoted as T,. All the frames in one specific video X, are deemed as manipulated 
if X, is labeled with fake, vice versa. The goal of deepfake detection is to learn a 
model $, which takes all consecutive frames of one video, and give a clear judgment 
of the authenticity, formulated as &(X,) € {fake, real}. 


3.2 Design of Model 


Spatial Consistency of Contexts vs. Faces. Computing similarity scores 
among images patched for inconsistency has already been proved effective in image 
forensic researches [33, 35,36]. Without loss of generality, we first obtain feature f; 
of image x; from backbone model G of size H’ x W’ x C’ where H’ and W” and 
patch numbers along columns and rows. 


f= Ge) 2, ERT XE xW" (1) 
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For each frame x; we follow the [35] to calculate the 4D consistency map SM 
with: R M ag 
SM hah = d( t a t i ) 
jk aia (2) 
=1— cos(f sfe) 
While each frame’s mask have only two possible status : manipulated or not, 
for patch P located in face area denoted as Pr, else in context as P., and Yy( Pp) = 1 
else y( P.) = 0, the ground truth: 


SMp,,p, = (Fi) ® W(P;) (3) 
and the spatial consistency loss: 
Lsc =|SM — SM| (4) 


Temporal Consistency of Consecutive Frames. In order to catch inconsis- 
tency between successive frames, we further extend the attention to temporal con- 
sistency learning. As we have obtained the patch-base feature f; from x+, we con- 
sider the relation between f; and f;_1. For each path Py, w at timestamp t, we have 
a 2D consistency map: 


E hw ph, 
TM prw phw =d fe aTa) (5) 
hw fh, 

= 1-cos (f ”, fe) 
considering the momentum between t and t — 1, we calculate temporal consistency 
loss: 7 
Dra TMi nh,w 


l (6) 


Lro = 5 ITM, 
t 


Coordinating Temporal and Spatial Consistency. It’s not hard to imaging 
that no matter in pristine or deepfake video people’s face will be moving most of 
the time, either talking or acting expressions. Otherwise the there’s no need to 
forge this static video which conveys no more information than just a photo. Only 
measuring the discrepancy of distance between consecutive face and context would 
yield lots of false alarm. Therefore we propose a comprehensive consistency coor- 
dination loss for adaptive learning by monitoring the relation between temporal 
and spatial consistency. Now we have final Loss function: 


L= Lreat/ fake + ALsc + Lro + (1— B)Leoce (7) 


4 Experiment Results 


Implementation Details. We modify Xception [4] as the backbones and their 
parameters are initialized by Xception pre-trained on ImageNet. We train our 
model using Adam optimizer with initial learning rate le-4 and weight decay le-7. 
Train epoch size is set to 2000, batch size is set to 32, and if validation loss is not 
getting better in 5 epochs, learning rate is decayed by factor 0.3, so that model can 
converge after several learning rate decays. 
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4.1 In-Dataset Evaluation on FF+-+ 


FF-+-+ is one of the most popular dataset for evaluating deepfake detection meth- 
ods. It contains 1000 real videos collected from internet, and 4000 fake videos gen- 
erated by four kinds of deepfake techniques. More over, FF++ provides 3 different 
qualities of videos, we use the high quality (c23) and low quality (c40) versions in 
this section. The raw quality videos are not considered because they are not very 
common on the internet. We use the same split as [26], both real and fake video is 
split into train, validation and test set according to the ratio of 72:14:14. But it is 
noticed that number of real videos is much smaller than fake videos. So, we over- 
sample real videos to balance the classes when training. At test stage, one video 
could contain several clips in FF++, we extract as much clips as we can from one 
video (interval is set to 16, no overlap), and take the average score of all clips as 
confidence score of the video. The test results are listed in Table 1. 


Table 1. In-dataset Performance (ACC %) on four types of deepfake in FF++. 
DF: DeepFakes, F2F: Face2Face, FS: FaceSwap, NT: NeuralTextures. The best result 
is shown in bold text, and the second-best is underlined. 


Methods DF F2F |FS NT 

Frame Level | LD-CNN [10] 75.00 56.00 | 51.00 | 62.00 
Constrained Conv [5] 87.00 82.00 | 74.00 | 74.00 
CustomPooling CNN [3] | 80.00 | 62.00 | 59.00 | 59.00 


MesoNet [1] 90.00 83.00 | 83.00 | 83.00 
Xception [4] 96.01 93.29 | 96.71 | 79.14 
Video Level | PCL [35] 96.87 94.93 | 98.44 | 99.58 
PD [33] 97.53 | 96.57 | 95.01 | 92.55 
ours 100.00 99.84 | 99.21 | 99.37 


4.2 Cross-Dataset Evaluation on Celeb-DF 


The poor Generalization ability of deepfake detection is still a thorny problem, even 
the state-of-the-art methods suffer from drastically performance degradation when 
test on deepfakes generated by unseen techniques. Our method tries to formulate 
deepfake detection from a discrepancy discovering aspect, and achieves the best 
cross-dataset performance, as the results listed in Table 2. The test model is trained 
on FF++ low quality, follow the setting of [22] for fair comparison. It is noticed 
that many methods report around 100% AUC on train set, but fail to transfer to 
the different dataset. Our model achieve the best cross-dataset test performance, 
while keep the best test result on train set. 
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Table 2. Cross-dataset Performance (AUC %) on Celeb-DF. The best result is shown 
in bold text, and the second-best is underlined. 


Methods FF++ | Celeb-DF 
MesoNet-Inception [1] | 83.00 | 53.60 
FWA [20] 80.10 | 56.90 
Xception-raw [4] 99.70 | 48.20 
Xception-c23 [4] 99.70 | 65.30 
Xception-c40 [4] 95.50 | 65.50 
DSP-FWA [20] 93.00 | 64.60 
Two-Branch [22] 93.18 | 73.41 
PCL [35] 99.79 | 72.44 
Patch-Diffusion [33] | 99.85 | 74.27 
ours 99.85 | 77.73 


4.3 Ablation Study 


This section analyzes the effectiveness of our proposed CSCL module. CSCL con- 
sist of three parts in total: the spatial consistency, temporal consistency and com- 
prehensive consistency coordination. To further validate whether each part of com- 
prehensive consistency can improve the generalizability, we conduct an ablation 
study by comparing our methods with the following variant. (1)Xception [4]: the 
baseline approach without using any consistency cue. (2)Xception w/ sc: we follow 
the setting of [35] with only spatial patch consistency loss. (3)Xception w/ tc: we 
use only temporal consistency loss upon baseling. (4)Ours full CSCL model with 
both spatial and temporal consistency, plus consistency coordination loss. Results 
are listed in Table 3. 


Table 3. Ablation Performance in FF+-+. The best result is shown in bold text. 


Methods AUC(HQ) | AUC(LQ) 
Base Line | PD [33] 99.85 94.43 
PCL [35] 99.79 96.38 
Ablation | Xception+SC 98.46 98.01 
Xception+TC 95.13 94.93 
Xception+SC+TC 99.46 98.16 
CSCL(SC+TC+CCC) | 99.85 98.21 


5 Summary 


In this paper, we try to address the problem, deepfake detection, from the view 
of comprehensive self-consistency learning. More specifically, we propose a CSCL 
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model with spatial-temporal consistency learning to explicitly formulate the inher- 
ent flaws of intra- and inter-frame disalignment in deepfakes. To achieve more effec- 
tive and robust deepfake detection, we also proposed C’ Loss, namely compre- 
hensive consistency coordination loss, which tackles the inevitable artifact within 
deepfake producing pipeline. Extensive experiments demonstrate the superior per- 
formance of our method in deepfake detection, especially in more realistic tests like 
cross-dataset and low quality setting. 
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Abstract. In response to the fact that traditional asset value assessment methods 
are subjective and cannot distinguish the value of different assets carrying the same 
type of business, a comprehensive assessment method that takes into account the 
importance of the business carried by the assets is proposed. In this paper, four 
factors affecting business importance are selected as evaluation indicators, and 
the CRITIC objective assignment method is used to obtain the weights of each 
evaluation indicator, calculate the importance of the business carried by the asset, 
and then calculate the asset value using the multiplication method with the assigned 
values of the asset in terms of confidentiality (C), integrity (I) and availability (A). 
The results of the case validation show that the calculation results of assessing the 
asset value by combining business importance are consistent with the actual value 
of the asset, and the comparison results with the traditional method show that the 
proposed method is more objective and reasonable in assessing the asset value. 


Keywords: Asset value assessment - Business importance - CRITIC objective 
empowerment method - Multiplication method 


1 Introduction 


As government departments, financial institutions, enterprises and institutions, and com- 
mercial organizations rely on information systems, information security issues have 
received widespread attention and importance. Using risk assessment to analyze the 
security risks in information systems and propose targeted corrective measures is an 
effective means to solve information security problems. Among them, identifying assets 
and assessing their value is the primary task of information security risk assessment, 
and the current calculation of asset value is mainly to be achieved based on confidential- 
ity (C), integrity (I) and availability (A) [1]. Tang [2] proposes an objective assignment 
method to assign weights to evaluation indicators to make the calculated importance val- 
ues more objective, but this weighting method only considers the dispersion of data and 
does not consider the correlation between indicators. In the literature [3], it is proposed 
that since quantifying the security level of assets in terms of confidentiality, integrity, and 
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availability is prone to subjectivity, the importance of the business carried by the assets 
is considered as a factor to reduce the subjective influence, and then using weighting 
and other methods to synthesize the value of the assets, but the literature does not give a 
specific implementation algorithm. Xiang Hong [4] identifies assets based on business 
and uses the AHP method to assign values to assets. The AHP method is effective in 
reducing the drawbacks of being completely subjective, but the method requires mul- 
tiple experts with rich experience to give a reliable judgment matrix and also involves 
calculations such as consistency tests, which increases the complexity of the calcula- 
tions. Zhou Jing-Xian [5] uses the rough set approach to calculate the value of each 
asset by assigning weights to four factors of CIA that determine the value of the asset 
and the importance of the business undertaken, and since the importance to the business 
is judged by human, then once the decision makers differ, it will result in a situation 
where the same asset has different values. In order to solve the above problems, it is 
necessary to propose to calculate the asset value by combining the importance of the 
business carried by the quantified assets. 

In this paper, we propose a method to evaluate the asset value by using the importance 
of the business carried by the asset together with the four factors of confidentiality, 
integrity and availability. The method selects the evaluation indexes that can reflect the 
importance of the business carried by the asset and calculates the business importance 
value by combining the CRITIC weighting method, and then uses the multiplication 
method for the four influencing factors to obtain the asset value, which can reduce the 
subjective influence of the traditional method when considering CIA and distinguish the 
value of assets that belong to different organizations but carry the same type of business 
and assets that carry different types of business under the same organization. 


2 Asset Valuation Method 


The value of an asset is determined by the level of assignment of the three security 
attributes of confidentiality, integrity and availability, as well as the importance of the 
business undertaken by the asset. The realization of a complete business requires the 
involvement of multiple assets, and the more significant the business is, the more impor- 
tant its associated assets are. Based on this, this paper proposes an intuitive asset valuation 
model to analyze the value of assets. 


2.1 Asset Valuation Model 


The asset value assessment model proposes in this paper is depicted in Fig. 1. For 
information assets, their value is mainly reflected in four indicators: confidentiality, 
integrity, availability and the higher the requirements for these indicators, the higher the 
asset value. The three security attributes of confidentiality, integrity and availability are 
classified into five levels: very high, high, medium, low and very low, and the higher the 
level, the higher the requirement of the asset for this security attribute. The importance 
of the business carried by the assets is mainly reflected in the business itself and the 
impact of the assets attached to the business on the organization, so the importance of 
the business can be evaluated from four aspects: organization ranking, organization level, 
scope of impact and business category. 
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Fig. 1. Asset valuation model 


2.2 Business Importance Indicators System Construction 


From the evaluation model, it is obvious that the asset value will be affected by the 
importance of the business. In order to evaluate the asset value more accurately, it is 
required to select indicators that rely completely on objective data to quantify the impor- 
tance of the business. In this paper, the business category and influence range indicators 
proposed by the business itself point out that the more core the business category is, 
the higher the importance of the business, and the more extensive the influence range is 
when the business cannot operate normally, the higher the importance of the business. 
However, considering that the selected indicators do not fully reflect the importance 
of the business and the indicators proposed from the business itself cannot distinguish 
the importance of different businesses that belong to the same business category and 
have the same scope of influence, this paper based on the assets on which the business 
depends, and proposes two indicators, organization ranking and organization level, to 
reflect the importance of the business running on them by measuring the importance of 
the assets. Among them, organization ranking refers to the ranking of the organization 
to which the asset belongs within the industry. The higher the ranking, the stronger the 
organization is in the industry and the higher the importance of its subordinate assets; 
The organization level refers to the category in which the organization to which the asset 
belongs is classified in that industry. The higher the category level belongs to, the more 
important the organization is and the more important its subordinate assets are. (If the 
value of an indicator of the assessment object cannot be determined, we may assign the 
same default value to the indicator and it is necessary to ensure that the final sum of all 
indicator weights is 1.) 

With regard to the organization ranking, organization level, and influence range 
indicators, it is necessary to analyze the reports issued by the organization to which 
the actual assets belong to obtain their values, while the business category indicator 
can be determined by initially knowing the classification of the business according to 
the literature [6] and then combining it with the business carried on the actual assets 
to identify the specific category. The literature roughly classifies businesses into five 
major categories according to their characteristics (since specific business systems are 
not mentioned, the information in the table is not complete, and the classification of 
businesses in the actual assessment work should be based on the actual situation), as 
shown in Table 1. Because the value of an asset takes into account the importance of the 
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business it carries, it is likely that the exact same information asset will have a different 
value to the organization being evaluated because of the different businesses it carries. 


Table 1. Example of business classification. 


Business category Business data characteristics Assignment 

Production application business class | Directly linked to core business 5 
operations 

Financial marketing business class Handling confidential internal 4 


operations and classified data 


Management information business Office automation and other 3 
class management information services 

Open business class Direct to external users 2 
Other businesses Ensure the normal operation of the 1 


basic system 


2.3 Business Importance Calculation 


Because of the differences in the contribution of each indicator to the importance of 
business, this paper uses the CRITIC method [7] to assign weights to indicators to 
calculate the importance of business. As can be seen from the previous subsection, 
business importance is determined by four indicators: organization ranking, organization 
level, influence range, and business category, and the CRITIC method which takes into 
account the conflicting nature of the indicators and the characteristics of the differences 
in the values taken by the evaluation objects under each indicator is used to calculate 
the weight of each indicator [8]. For example, if there is a greater conflict between the 
organizational ranking indicator and other indicators, the greater the difference in the 
data under that indicator, which means that the indicator contains more information, 
that is, it has greater weight and contributes more to the importance of the business. 
Similarly, the weights of other indicators can be obtained from the CRITIC method [9], 
and the calculation steps are as follows:(In this paper, if we select only one indicator to 
assess the importance of the business, we only need to do the normalization step of the 
indicator data in this algorithm). 


1) In order to eliminate the influence on the evaluation results of different magnitudes, 
formula (1) was used to reverse the process for the indicators belonging to the smaller 
value, and formula (2) was used to forward the process for the indicators belonging 
to the larger value [10]: 


x _ max (x/) — xij a) 


max (xj) — min(x;) 
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,__ xy- min(s) 
“0 max (5) — minfa) a 


In the formula: max(x;) is the maximum value of the j indicator, min(x;) is the 
minimum value of the j indicator, x;; is the value of evaluation object i under indicator 
J, xij* is the processed value and its value range is [0,1]. 
2) After the data were processed, the standard deviation of each indicator was calculated 
using formula (3) as an indication of the difference in the values taken by each 
assessment subject under each indicator: 


= [Ee s 


Among them, j is the standard deviation of the j indicator and is the average of 
n assessment objects under indicator j. 
3) Formulas (4) and (5) are used to calculate the magnitude of conflict be-tween 
indicators: 


L= Gi — T(x — X) 
VEe -0 E-a) 
A=) (1-r) (5) 


In the formula, r;; denotes the correlation coefficient between indicator i and 
indicator j, xig and x; denote all data under indicator i and indicator j, respectively, 
and A; denotes the conflict between indicator j and other indicators. 

4) From formula (6), the weights of each indicator is w1, w2, w3 and w4: 


(4) 


fy = 


_ ojAy 
= TM 
a1 GAL 


5) According to the weight of each indicator and the value of each business object 
under each indicator, the business importance a can be obtained: 


M 
a= ee wija (7) 


(6) 


Wj 


2.4 Asset Value Calculation 


After obtaining the asset’s assigned level of confidentiality, integrity and avail- ability 
and the importance of the business it carries, then we use the multiplication method to 
calculate the asset’s value. The specific calculation steps are as follows: 

Set the value of the j asset as dj, its values in confidentiality, integrity, and availability 
as c, i, and a, and its business importance as a. The formula of calculating the asset value 
is as follows: 


dj = Yaxcxixa (8) 
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3 Evaluation Examples and Results Analysis 


This paper uses the bank assets which are obtained from the extranet as the valuation 
object, and applies the above calculation method to calculate the value of each asset, and 
compares and analyzes the results obtained with the traditional method. 


3.1 Instance Data 


In order to determine the value of the indicators which are selected for the calculation of 
business importance, the analysis will be performed here in combination with the actual 
assets. From the literature [11], we know the ranking of organizations to which bank 
extranet assets belong, and from the literature [12], we can summarize and classify the 
organizations of bank information system assets into five major levels: state-owned large 
banks, state-owned commercial banks, regional urban commercial banks, rural banks 
in each county and district, and private banks. For the bank information system, the 
impact range of the business on it can be reflected by the impact range of the organiza- 
tion to which the asset belongs, so the impact range can be divided into five categories: 
global, national, province/municipality/autonomous region, city, and county, and assign 
values in descending order of range. Combined with the actual evaluation object and ac- 
cording to the literature [13], the categories of services carried by the bank’s extranet 
assets can be classified into five major categories: transaction-type services, customer 
exchange services, online investment services, information services and other services. 
The transaction services include money transfers and credit operations performed by 
individuals or companies, which are the highest level of banking service systems and 
definitely have access to the bank’s internal network. The customer exchange service 
is the communication of information, documents or files between the customer and the 
bank [14, 15], and this kind of service is a higher-level service system and has access to 
the bank’s internal network. The online investment service [16] is a service that provides 
customers to purchase various types of financial products launched by the bank. The 
information service is to publish information that can be accessed by everyone, and this 
type of service is the most basic type of business that has no access to the bank’s inter- 
nal network. The other services include various forms of special value-added services, 
such as life type payment services. The business categories are assigned according to 
the degree of connection to the bank’s internal and the level of the service system, see 
Table 2. Through the above analysis, the 18 acquired bank extranet assets are organized 
as shown in Table 3. ( The 18 selected assets S|-S1g are ICBC about ICBC system, China 
Construction Bank deposit and loan and bank card system, China Construction Bank 
investment and finance system, Agricultural Bank of China personal service system, 
Agricultural Bank of China talent recruitment system, Bank of China personal financial 
system, Bank of China electronic banking system, Bank of JIANGSU personal business 
system, Chongqing Rural Commercial Bank savings business system, Chengdu Rural 
Commercial Bank’s personal financial service system, Bank of Chongqing’s personal 
business system, TRC Bank’s savings business system, Bank of Dongguan’s personal 
business system, NRC Bank’s savings business system, Bank of Tangshan’s email sys- 
tem, XIAOSHAN Rural Commercial Bank Savings Business System, ZJB’s savings 
business system.) 
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Table 2. Assignment of business importance indicators. 


Level Bank level Scope of influence Business category 

assignment 

5 Large Global Trading business 
state-owned 
enterprises 

4 State-owned National Customer 
commercial communication 
banks services 

3 City Province/Municipality/Autonomy Online investment 
commercial services 
banks 

2 Rural banks City Information 

services 
1 Private Banks County Other services 


Table 3. Asset information form. 


Asset number | Organization | Organization Influence scope | Business type 
ranking category 

S1 1 Large state-owned | Global Trading business 
enterprises 

S2 1 Large state-owned | Global Information 
enterprises services 

$3 2 Large state-owned | Global Trading business 
enterprises 

S4 2 Large state-owned | Global Online 
enterprises investment 

services 

S5 3 Large state-owned | Global Trading business 
enterprises 

S6 3 Large state-owned | Global Information 
enterprises services 

S7 4 Large state-owned | Global Trading business 
enterprises 

S8 4 Large state-owned | Global Customer 
enterprises communication 

services 


(continued) 
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Table 3. (continued) 
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Asset number | Organization | Organization Influence scope | Business type 
ranking category 

S9 18 City commercial Province Trading Business 
banks 

S10 22 Rural banks County Trading business 

S11 36 Rural banks County Trading business 

S12 44 City commercial Municipality Trading business 
banks 

S13 53 Rural banks County Trading business 

S14 62 City commercial City Trading business 
banks 

S15 70 Rural banks County Trading business 

S16 88 City commercial City Other services 
banks 

S17 95 Rural banks County Trading Business 

S18 100 Rural banks County Trading Business 


3.2 Business Importance Calculation 


We use organization ranking (Index 1), organization level (Index 2), scope of influence 
(Index 3), and business type (Index 4) as four indicators to assess the importance of the 
business carried on the bank’s assets. The four indicators are quantified in Table 2, and 


the results are shown in Table 4. 


Table 4. Quantification of business importance indicators. 


Asset Index 1 | Index 2 | Index 3 | Index 4 | Asset Index 1 | Index 2 | Index 3 | Index 4 
Number Number 

S1 1 5 5 5 S10 22 2 1 5 
S2 1 5 5 2 S11 36 2 1 5 
S3 2 5 5 5 S12 44 3 3 5 
S4 2 5 5 3 S13 53 2 1 5 
S5 3 5 5 5 S14 62 3 2 5 
S6 3 5 5 2 S15 70 2 1 5 
S7 4 5 5 5 S16 88 3 2 1 
S8 4 5 5 4 S17 95 2 1 5 
S9 18 3 3 5 S18 100 2 1 5 


170 X. Yang et al. 


We use formula (1) and formula (2) to forward or reverse the values of assets under 
the above four indicators. For assets, the smaller the value under the organization ranking 
indicator, the better, while the larger the value under the three indicators of organization 
level, impact area, and business category, the better. From Table 4, it can be seen that asset 
S3 takes the value x31 of 2 under the organization ranking indicator, the data under this 
indicator has a maximum value of 100 and a minimum value of 1. Replacing into formula 
(1), the value of x3; after reverse processing can be obtained as X34 = o = 0.9899. 
The value x32 of asset S3 under the business category indicator is 5, and the maximum 
value of data under this indicator is 5 and the minimum value is 2. Replacing into 
formula (2), we can get X39 = = = 1. Similarly, the values x33 and x34 of S3 under the 
influence range and business category indicators are x3, = 1 and x3, = 1 respectively 
after processing by formula (2). Similarly, the values of other assets under the four 
indicators are processed similarly. 

After the above processing, the mean value of each indicator can be found as 0.669, 
0.519, 0.528, and 0.819 in order. We then substituted the 18 data under the organization 
ranking index and the mean value of the index into formula (3) that we can find the 
standard deviation of the index is 0.362, and the standard deviation of the other indexes 
is similar to this, and the results are shown in Table 5. From formula (4) and formula 
(5), we can find the magnitude of conflict between each indicator and other indicators 
as 1.457, 1.558, 1.486, and 3.735. 

From formula (6), we can obtain the weight w; of the organizational ranking 
indicator: 


B 0.362 * 1.457 
~ 0.362 1.457 + 0.460 * 1.558 + 0.461 * 1.486 + 0.330 « 3.735 


wi = 0.167 


Similarly, the weight w2of the organization level indicator is found to be 0.227, the 


weight w3 of the influence range indicator is 0.217, and the weight w4 of the business 
category indicator is 0.389. 


Table 5. CRITIC method to calculate the weighting process. 


Index Standard deviation Conflicting indicators Weights 
Index1 0.362 1.457 0.167 
Index2 0.460 1.558 0.277 
Index3 0.461 1.486 0.217 
Index4 0.330 3.735 0.389 


From Table 4, the values of asset S3 under the four indicators are 2, 5, 5, 5, and after 
processing are 0.9899, 1, 1, 1, 1, and the corresponding weights of each indicator are 
(0.167, 0.227, 0.217, 0.389), and the importance 3 of the business on asset S3 is obtained 
from formula (7) as: 


a3 = wy * 0.9899 + wo x 1 + w3 x 1 + w4 x 1 = 0.9983 
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Similarly, we can get the importance of the business carried by other assets, see 
Table 6. 


3.3 Value Assessment 


Based on the method described in Sect. 2, the three major security attributes of the bank’s 
extranet assets are assigned, see Table 6. The confidentiality of assets is mainly analyzed 
and evaluated by the degree of disclosure of assets, for example, the confidentiality 
of deposit-related data within a bank is the highest, and once disclosed, it will have 
a very serious impact on the normal operation of the bank. The integrity is analyzed 
from the damage to the entire organization if the integrity of that asset is breached. The 
availability of assets is measured in terms of the damage caused to the organization 
by their functional interruptions. For assets which carry transaction-type services, there 
must be a connection channel with the bank’s internal network, so the confidentiality, 
integrity and reliability of the assets are of the highest level. For assets which run 
customer exchange services, there are generally connection channels established with 
the bank’s internal network. For assets that carry online investment services, there is a 
certain connection to the bank’s internal network. For assets carrying information service 
classes and other service classes, there is no connection channel with the bank’s internal 
network, so the C, I and A of the assets take lower values than the previous ones. For 
assets that carry other services, the CIA takes the lowest value compared to the others. 


Table 6. Asset value indicators. 


Asset number |C |I |A | Business Asset number |C I (A_ | Business 
importance importance 
S1 5 [a |5 1 S10 5 5 |5 |0.5206 
S2 3 /3 |3 | 0.7083 S11 5 5 5 |0.4970 
S3 5 |5 |5 | 0.9983 S12 5 5 |5 |0.6676 
S4 4 |5 |4 0.8038 S13 5 §5 |5 |0.4683 
S5 5 |5 |5 0.9966 S14 5 5 |5 |0.5830 
S6 3 |3 |4 0.7049 S15 5 5 |5  |0.4396 
S7 5 |5 |5 |0.9949 S16 2 2 |2 |0.1502 
S8 5 |4 |5 0.8977 S17 5 5 5 |0.3974 
S9 5 |5 |5 |0.7115 S18 5 5 |5 |0.3890 


From Table 6, the values of C, I, and A of asset S3 and the importance of the business 
it carries are 5, 5, 5, and 0.9983, respectively. Replacing formula (8), the value d3 of 
asset S3 is obtained as: 


d3 = J0.9983*5*5*5 = 4.997 


The remaining assets are evaluated by the same method and the results are written 
in Table 7. 


172 X. Yang et al. 


Table 7. Comparison of asset value results. 


Asset number | Traditional | Methodology | Asset number | Traditional | Methodology 
method of this article method of this article 

SI 5 5 S10 5 4.022 

S2 3 2.674 S11 5 3.961 

S3 5 4.997 S12 5 4.370 

S4 4 4.006 S13 5 3.883 

S5 5 4.994 S14 5 4.177 

S6 3 2.670 S15 5 3.802 

S7 5 4.991 S16 2 1.063 

S8 4 4.423 S17 5 3.676 

S9 5 4.464 S18 5 3.650 


3.4 Results Analysis and Comparison 


For bank extranet assets, the comparison between the evaluation results obtained by this 
paper’s method and the traditional method which only considers the three major security 
elements is shown in Table 7. The ranking of asset values obtained by the traditional 
method [17] is from highest to lowest (S1, S3, S5, S7, So, S10, S11, S12, S13, S14, Sis, 
S17, Sig, Sg, S4, S2, S6, S16), where assets Sı to S15 and assets S17, S1g are all obtained 
with asset values of 5, and assets Sg, S2, S4, Se are all obtained with asset values of 
3.The asset values obtained from the methods in this paper are ranked from highest to 
lowest (S1, S3, S5, S7, So, Sg, S12, S14, S10, S4, S11, S13, $15, S17, Sig, S2, S6, S16), and 
the values of each asset are different. It is easy to conclude that the value of each asset 
calculated by the traditional method is the same, making it impossible to distinguish the 
value of assets that carry different business types in the same organization and assets 
that carry the same business type in different organizations, while the value of these 
assets can be clearly distinguished by the results calculated by considering the business 
importance factor proposed in this paper. For assets S3 and S4, which belong to the 
same organization but carry different types of business, the values of assets which are 
calculated by using the traditional method are 5 and 4, and the results obtained by using 
the method in this paper are 4.997 and 4.006. Both methods obtain a higher value for 
asset S3 than asset S4, which indicates that the proposed method is correct and feasible. 
For asset So and asset S12, which are both personal business systems, the values of C, I, 
and A are the same, so the results obtained by the traditional method are the same, both 
are 5, and it is impossible to distinguish whose value is higher, while the results obtained 
by using the method of this paper are 4.464 and 4.370, because although the CIA values 
of the two assets are the same, it can be seen from Table 7 that the importance of the 
business carried on So is higher than that of S12. This is because although the CIA values 
of the two assets are the same, the importance of the business carried on So is higher 
than that of S12. Therefore, it can be concluded that the value of asset S9 is higher than 
that of asset $12, which indicates that for assets which carry the same type of business 
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in different organizations, the value of these assets can be distinguished better by using 
the results calculated by this method than the traditional method. 

It can be seen from the above examples that on the basis of the factors of confiden- 
tiality, integrity and reliability that affect the value of assets, it is necessary to consider 
the importance of the business carried by the assets, not only can reduce the influence 
of subjective factors, but also can solve the problem that the value of different assets 
carrying the same type of business cannot be distinguished by using traditional methods. 
Therefore, it is practical and realistic to use this paper’s method to assess the value of 
assets. 


4 Conclusions 


The objective, accuracy, and ease of differentiation are the goals that must be achieved 
for information asset value assessment. In this paper, considering the three security 
attributes of confidentiality, integrity, and availability of assets, we propose that the 
value of assets is also influenced by the importance of the business they carry, and use 
the multiplication method to calculate the value of assets. We use the CRITIC assignment 
method to assign weights to four objective evaluation indicators which measure business 
importance: organization rank, organization level, service scope and business category, 
and then calculate business importance from the obtained weights of each indicator and 
the data processed by forward or inverse direction. In this paper, the feasibility of the 
proposed method is verified by evaluating the value of bank assets which are obtained 
from the extranet. The method can also be applied to other organizations to calculate the 
value of assets and prepare for the subsequent risk assessment work. 
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Abstract. There is a huge Internet user group in China, and many enterprises and 
institutions are deeply affected by the threat of Cybersecurity vulnerabilities. At 
present, according to the needs of different business scenarios, relevant business 
personnel often need to search for different vulnerability information separately, 
relying on manpower, and the vulnerability intelligence distributed on the Internet 
has the characteristics of multi-source heterogeneity, which is difficult to ensure 
the effectiveness and reliability of vulnerability knowledge. In view of the above 
background, with vulnerabilities as the core, knowledge extraction of vulnerability 
intelligence is carried out according to existing standards, corresponding entities 
and relationships are established, and related and visualized knowledge graphs 
are studied and constructed to provide support for the discovery and traceability 
of vulnerability threats by information workers. 


Keywords: Knowledge graph - Cybersecurity - Vulnerability - Named entity 
recognition - Relationship extraction 


1 Introduction 


With the rapid development of the Internet industry, a large number of Cybersecurity 
vulnerabilities have been gradually discovered and exploited in the use of various compa- 
nies’ products, causing potential risks to production and daily life. Vulnerability threat 
discovery and traceability have become common challenges and work requirements 
for personnel including system operation and maintenance and network management. 
There are various sources of vulnerability information, including vulnerability reports 
from various open source communities, public vulnerability databases, and products’ 
patch information etc., which have the characteristics of scattered data, incomplete infor- 
mation, and different structures, and the vulnerability knowledge caused by data sources 
such as different Internet community platforms. The information of high quality and 
low quality are mixed, the repetition is high, the correlation is not clear, the data quality 
cannot be guaranteed, and it cannot effectively support the work needs of Cybersecurity 
business personnel for vulnerability detection, analysis and judgment. 
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In recent years, knowledge graphs can use deep learning to form valuable infor- 
mation and knowledge models through data collection, analysis, and mining. Since the 
knowledge graph theory was proposed by Google and applied to intelligent search [1], it 
was initially applied efficiently in the commercial field, such as the LinkedIn economic 
graph (User Profile) in the social field, and the Tianyancha enterprise graph (Enterprise 
Profile) in the field of enterprise information, etc. 

In various vertical fields in China, there has been research and exploration on the 
application of knowledge graphs. An Ning et al. [2] proposed the construction of a 
cross-platform network public opinion knowledge graph, using Sina Weibo and Douyin 
short videos as data sources to build a network public opinion knowledge map, which is 
mainly used in the management and guidance of network public opinion. Xiao Le et al. 
[3] proposed knowledge graph for grain situation is mainly based on the grain situation 
dictionary and Flat-lattice model to extract grain situation entities for construction, which 
is used to assist grain situation decision-making. Mou Tianhao et al. [4] proposed a 
knowledge graph of process industrial control systems based on the control system 
cyber-physical asset management tasks to solve business problems related to industrial 
control systems. Zhang Kunli et al. [5] took obstetric diseases as the core and proposed 
a Chinese obstetric knowledge graph to facilitate medical question and answer and 
auxiliary diagnosis and treatment. 

There are few applications of knowledge graphs in the field of Cybersecurity. This 
paper uses knowledge graphs to correlate numerous isolated vulnerability intelligences 
and present a panorama of vulnerability entities, which provides a new idea for vulner- 
ability research and analysis, and helps to promote solutions for difficulties related to 
Cybersecurity business. 


2 Vulnerability Knowledge Graph Construction Route 


Large-scale domestic vulnerability databases include the China National Vulnerability 
Database (CNVD), the China National Vulnerability Database of Information Secu- 
rity(CNNVD) etc., which are the main methods for the construction and sharing of vul- 
nerability intelligence [6]. Combining with the current situation of information security 
development, the sources of vulnerability intelligence in this paper are CNVD, CNNVD 
and CVE (Common Vulnerability Disclosure). After the vulnerability knowledge is 
integrated, manual proofreading is finally performed, and data with low confidence is 
discarded to ensure the quality of the vulnerability knowledge base. At the same time, 
the knowledge extraction model is continuously supervised and trained with new intel- 
ligence. With the accumulation of data, more new knowledge base data sources such as 
open source security websites are added as appropriate, and finally the entire system is 
iteratively updated. 
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2.1 Schema Layer Design 


The schema layer of the vulnerability knowledge graph is above the data layer, and the 
core is the ontology library, which is an abstract representation of vulnerability knowl- 
edge, like the “class” in object-oriented. The schema layer mainly includes: entity- 
relation-entity, entity-attribute-attribute’s value. Based on “Information security tech- 
nology—Cybersecurity vulnerability identification and description specification “ [8] 
(GB/T 28458-2020), the framework of vulnerability identification and description can 
be composed of identification items and description items. Taking into account the 
actual situation of domestic vulnerabilities, mainly from the perspective of vulnerabil- 
ity management and emergency response [9], the main attribute of the vulnerability is 


description 
time 

URL 

victim 


product 
description 
solution 
patch 
CVE_ID 


Fig. 1. The framework of entity and relationship 
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CNVD_ID, and the framework of the preliminary design entity and relationship is shown 
in Fig. 1. 

Based on the graph structure, entities are used to represent objects or abstract con- 
cepts in the vulnerability space, and relationships are used to model inter-entity inter- 
actions, the framework follows the triplet of (head entity, relation, tail entity). Entities 
are distinguished by boxes, each row under the entity name has its attributes, PK rep- 
resents the main attribute, and the arrow represents the relationship. The entity defines 
5: vulnerability = {CNVD_ID, title, date, level, product, description, solution, patch, 
CVE_ID}; event = {event_id, description, time, URL, victim}; company = {name}; 
product = {name}; victim = {name}. Relationships define 4: influence, raise, belong 
to, use. More entities, attributes, and relationships can be gradually expanded according 
to this framework. 


2.2 Data Layer Construction 


The vulnerability knowledge graph data layer consists of three steps: data collection, 
knowledge extraction, and knowledge fusion. 


2.2.1 Data Collection 


Vulnerability, company, and product data are obtained from the unstructured text of 
the China National Vulnerability Database (CNVD) and semi-structured text of CVE 
(Common Vulnerability Disclosure) [10]. According to their own circumstances, the 
two entities, events and victims, can collect them in a compliant manner if they con- 
duct unified management of vulnerabilities for the unit and its subordinate units, or as 
vulnerability managers. 


2.2.2 Knowledge Extraction 


Knowledge extraction is a method to automatically obtain structured information such 
as entities, relationships, and entity attributes from heterogeneous data such as semi- 
structured or unstructured data. According to the characteristics of vulnerability intel- 
ligence text, this paper marks the vulnerability intelligence text with BIOES [11], and 
then performs the following main operations: entity extraction, attribute extraction, and 
relation extraction. They are introduced as follows: 


1) Entity extraction, namely named entity recognition (NER), refers to the automatic 
recognition of named entities from text datasets. At present, the main technical meth- 
ods of named entity recognition are divided into: rule-based and dictionary-based 
methods -- manual construction of rule templates, and pattern and string matching 
as the main means; statistical-based methods -- including Hidden Markov Model 
(HMM), Maximum Entropy (MEM), Support Vector Machine (SVM), Conditional 
Random Field (CRF); Neural Network methods -- the main models are NN/CNN- 
CRF, RNN-CRF, LSTM-CRF. The goal of attribute extraction is to collect attribute 
information of a specific entity from different information sources. For example, for 
a specific vulnerability, attributes such as name and affected product can be obtained 
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from the public information on the Internet. Entity and attribute extraction this paper 
adopts the BLSTM-CRF model (Bidirectional Long Short-Term Memory Network 
- Conditional Random Field) [12], which is currently more effective in the field of 
security vulnerabilities, taking the product entity (Apache Log4j) as an example, as 
shown in Fig. 2 


Embedding : E 


— h2 
LSTM 
A E 


Softmax S 


CRF Apache/B_name Log4j/B_name 


Fig. 2. The structure of BLSTM-CRF model 


2) Relation extraction. After the vulnerability intelligence text is extracted by entities 
and attributes, a series of discrete named entities are obtained. Continuing to obtain 
semantic information requires relation extraction: extracting the interrelationships 
between entities from related texts, and connecting entities through relationships 
to form a networked knowledge structure. The vulnerability knowledge graph is 
different from the social character graph. The relationship is relatively small and 
simple. For example, vulnerability A “raises” event B. Since the relationship defined 
in the schema layer is easier to distinguish in text data such as vulnerability reports, 
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this paper chooses the method of rule matching, and the recognized entities are 
automatically selected according to the definition of the relationship in the category 
and schema layer, and fine-tuning is performed later. According to the definition, the 
entity can conform to the rules based on the pattern, so the relationship between the 
entities is determined according to the trigger word, and the designed rule samples 
are shown in Table 1. 


Table 1. Samples of trigger word rules 


No Rules Relation 
1 Vulnerability A influences product B Influence 
2 Product B belongs to company C Belong to 
3 Vulnerability D raises event E Raise 

4 Event E belongs to victim F Belong to 
5 Victim F uses product G Use 


2.2.3 Knowledge Fusion 


After data collection and knowledge extraction, entities, relationships and entity attribute 
information are obtained from the original unstructured and semi-structured vulnerabil- 
ity intelligence data. However, the relationship between multiple sources (information) 
is flat and lacks hierarchy and logic; there is still a lot of redundancy and misinformation 
in the knowledge. Knowledge fusion is to solve this problem, through entity disambigua- 
tion and coreference resolution, to realize the integration of vulnerability knowledge. 
For example, the company “ 3!” and the “Apple” belong to the entity synonymous 
relationship and need to be integrated. After knowledge fusion, the noise and redundancy 
in the data are removed, and the quality of vulnerability knowledge is improved. 


3 Vulnerability Knowledge Graph Construction Results 


3.1 Experimental Environment 


The experimental environment of this paper: the operating system is Windows 10; the 
CPU is AMD Ryzen™ 7 5800H @3.2 GHz; the GPU is GTX 3050Ti (4 GB); the memory 
is 64 GB; the Python version is 3.7; the neo4j version is 3.1.1. 


3.2 Knowledge Graph Display 


Taking some generic vulnerability data and a small number of influenced victims under 
Apache as an example (entities are vulnerability ontology, historical events, involved 
victims, companies, and products; relationships are the edges of a directed graph), the 
constructed visual interface is shown in Fig. 3. 
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Fig. 3. Vulnerability knowledge graph 


3.3 Application Analysis 


In terms of vulnerability threat discovery and analysis, by constructing the graph to 
correlate and analyze vulnerability information, hidden information can be mined and 
effective judgments can be made. Referring to Fig. 3, various types of entities are used as 
nodes in the graph, and various types of relationships between entities are used as edges 
in the graph. Starting from a certain entity, such as an victim with critical infrastructure, 
you can know which products of which companies are used by the victim, and which 
security events have occurred due to which vulnerabilities occurred at specific times. 
Once a 0-day vulnerability occurs again in the corresponding products of the company, 
it can be reasonably predicted that the victim will be influenced by this vulnerability, 
and it will be warned in time before possible Cybersecurity events to avoid major losses. 
This information is often unavailable from a single vulnerability report, and knowledge 
graphs can organically connect numerous vulnerability information. 


4 Conclusion 


According to the characteristics of the vulnerability field, this paper first integrates 
multi-source vulnerability intelligence data to design a vulnerability knowledge graph 
framework; then uses a deep learning model to extract entities and attributes, extracts 
relationships based on pattern rules, and constructs a vulnerability knowledge ontology, 
check and analyze; and finally complete the multi-source knowledge graph. In the future, 
by further adding multiple vulnerability threat intelligence data sources, a larger and more 
complete vulnerability knowledge graph can be formed, which can effectively provide 
more Cybersecurity decision support for information workers. 
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Abstract. This paper analyzes the problems of traditional TCP protocol in the 
wireless network environment and proposes a scheme based on performance- 
enhancing agents, which is more suitable for the actual situation of current wireless 
core networks. TCP application optimization was employed to enhance conges- 
tion control. Based on the automatic learning mechanism of network path features, 
this paper proposes herein the dynamic algorithm ZetaTCP. In practice, the per- 
formance enhancement agent based on ZetaTCP In practice, the performance 
enhancement proxy based on ZetaTCP was verified and achieves good results in 
LTE networks. In practice, the performance enhancement proxy based on ZetaTCP 
was verified and achieved good results in the LTE network. 


Keywords: LTE - TCP protocol optimization - Congestion control - ZetaTCP 


1 Introduction 


TCP protocol, As the main protocol of online data transmission, TCP protocol has been 
widely used on mobile Internet, which carries over 90% of mobile Internet traffic in 
this case. Though the LTE network advances and the mobile Internet notably speeds up, 
traditional TCP technology tailored for the wired network environment cannot adapt to 
the wireless network environment with relatively poor link quality and frequent changes 
in latency and packet loss. Therefore, the improvement of the transmission performance 
of TCP protocol in the wireless network and the enhancement of the bandwidth utilization 
of wireless links are critical to optimizing the wireless core network [1]. 


2 Features and Problems Analysis of Standard TCP 


TCP provides transmission services with reliable point-to-point connections, subject to 
sliding windows to control the transmission rate. The congestion control of standard 
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TCP contains several key technologies, including “slow start’, “congestion avoidance”, 
“fast retransmission’, “quick recovery”, and “retransmission timeout” [2]. 

Standard TCP is very sensitive to packet loss, halving or minimizing the value of 
the congestion windows with a slow increase. Previously, packet loss in the wired net- 
work often indicates the occurrence of network congestion, which can be better con- 
trolled and quickly recovered by standard TCP. Nonetheless, the current wireless network 


environment brings about increasingly salient defects of the standard TCP protocol: 


2.1 Ineffective Congestion Judgment Mechanisms 


In a wired network with a low bit error rate, it is rational for TCP to assume that packet 
loss is triggered by network congestion. The packet loss, nevertheless can also be caused 
by sudden errors in wireless channels, mobile device handoffs, attenuation channels, or 
changes in network topology. In this context, standard TCP cannot accurately distinguish 
whether packet loss is derived from congestion or not, resulting in congestion misjudge. 


2.2 Slow Congestion Recovery Mechanisms 


Once detecting packet loss, TCP will trigger the response for congestion control in three 
steps. At first, the packets failing to be confirmed will be retransmitted, thus reducing the 
congestion window and the transmission rate; Then, it will activate the congestion control 
mechanism, consisting of exponential back-off of the timeout clock and a reduction in 
the slow start threshold. At last, the congestion avoidance stage will be activated to 
relieve the congestion. If the packet loss results from channel errors or mobile device 
handoffs, the congestion recovery mechanism of TCP will induce throughput drop and 
longer latency. 


2.3 Inaccurate Packet Loss Judgment Mechanism 


The standard TCP stack determines packet loss by two methods. One is the number of 
consecutive Dup-ACKs, and the other is the ACK timeout. When there are considerable 
packet losses, ACK timeout is preferred to interpret the timeout and trigger retransmis- 
sion. In a modern network, packet losses are often burst, and it’s natural that multiple 
data packets are lost simultaneously on a connection. Therefore, standard TCP must rely 
on timeout for retransmission, which often leads to a waiting state of several or even ten 
seconds, causing long stagnant transmission, or even disconnection [3]. 


3 ZetaTCP Optimization 


Pursuant to the survey and analysis of the quality of mobile Internet access in all network 
operators, TCP traffic accounts for the vast majority of all the existing network traffic of 
mobile users accessing mobile Internet applications. However, due to the frequent and 
changing delay and packet loss of the wireless network environment, the transmission 
efficiency of traditional standard TCP is often substantially low in this context [4]. In 
case the defects of TCP’s treatment mechanism in various wireless network conditions 
can be corrected, and the efficiency of TCP traffic transmission can be enhanced, the 
user’s mobile Internet experience can be significantly improved [5]. 
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3.1 Comparison and Improvement of TCP Optimization Technology 


Most domestic and overseas TCP protocol optimization technologies apply static algo- 
rithms, which utilize fixed congestion judgment and recovery mechanisms in accordance 
with the assumption of the Internet traffic model. As the Internet environment progresses, 
the traffic characteristics are increasingly complicated and difficult to predict. Against 
such a circumstance, these TCP protocol optimization techniques can only be valid in 
specific network scenarios where the premise is established. Moreover, as the transmis- 
sion progresses, the network path characteristics may change and the effect may turn 
out to be unstable. Two common TCP optimization algorithms are presented below: 

The Vegas TCP algorithm defines a state variable, Base RTT (basic round-trip delay), 
whose theoretical value should be “round-trip delay of connection without congestion.” 
With the delay change as the congestion indicator, the Vegas algorithm is more sensitive 
to the judgment of network congestion so as to decrease the packet loss rate of the network 
and obtain excellent average throughput rates in all networks using the Vegas algorithm. 
Nevertheless, in a network environment mixed with packet loss-based algorithms, it 
has always seen a rapid rise in time delay occurring before packet loss. In this case, 
Vegas always shrinks CWND (congestion window) before packet loss-based algorithms 
and reduces the transmission rate, making its overall performance inferior to packet 
loss-based algorithms. Vegas TCP is characterized by the relatively low transmission 
performance during shallow cohort congestion and frequent changes in wireless network 
delay [6]. 

CUBIC TCP, an enhanced version of BICTCP, simplifies its window adjustment 
algorithm. A cubic function is deployed as the growth function of the congestion win- 
dow, grows only according to on the time interval between two consecutive congestion 
events. CUBIC is the default TCP algorithm of the Linux kernel. The CUBIC TCP 
has relatively low transmission performance under non-congestion packet loss and deep 
queue congestion in wireless networks [6]. 

The Performance-enhancement Proxy Based Scheme is adopted in a bid to improve 
the drawbacks of the above algorithms and enable TCP to register a high transmission 
efficiency in the wireless network with long latency and frequent link errors [4]. The 
method to segment the original TCP connection by the above Scheme is also known as 
TCP segmentation (see Fig. 1). 

The Performance-enhancement Proxy Based Scheme follows the idea that local 
problems should be solved locally. By deploying the agent, the TCP connection between 
the server and the wireless mobile end falls into two sections at a certain node in the 
middle, with one deployed on the server sending end of the fixed network, and the 
other is connected to the mobile receiving end of the wireless network, which blocks 
the influence of the wireless environment on the server sending end. In this case, the 
server sending end can prevent irrational activation of the congestion control algorithm 
irrespective of random packet loss of the wireless network. By virtue of the improved 
TCP deployed on the enhanced proxy, the performance of TCP in wireless networks will 
be strengthened, and data transmission rate to mobile ends will be elevated [7]. In this 
scheme, there is no need for any modification on the TCP protocol stack of the server 
sending end and the wireless mobile receiving end, which is feasible to implement at 
this stage. 
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Fig. 1. Schematic diagram of performance-enhancement proxy based scheme 


3.2 ZetaTCP Optimality Principle 


It adopts the dynamic self-learning algorithm (ZetaTCP) with network path features and 
utilizes the Performance-enhancement Proxy Based Scheme. It observes and analyzes 
the real-time network features on each TCP connection and adjusts the algorithm anytime 
in accordance with the learned network characteristics. By doing so, it can judge the 
degree of congestion more accurately and distinguish packet loss more promptly, thereby 
handling the congestion more appropriately and retransmitting the lost packet more 
swiftly. According to the design principle, it helps the static algorithms adapt to changes 
in network path characteristics and ensures that the acceleration effect is constantly valid 
even under various network environments and frequently changing network delay and 
packet losses. 

Besides applying the above two approaches of standard TCP, ZetaTCP also considers 
packet loss and delay changes into consideration and introduces a self-learning dynamic 
algorithm mechanism with TCP connection path network characteristics to make the 
congestion judgment more actuate and timely. The dynamic learning mechanism can be 
used to determine the network path characteristics of each specific connection during 
the transmission process. The characteristics include end-to-end delay and its changing 
features, arrival interval and its variation of receiving end feedback packet (ACK), packet 
reversal degree and its changing features, delay jitter possibly caused by deep data 
detection of security equipment, and random packet loss induced by various factors. 
ZetaTCP tracks these features in real time, apprehends these features in all aspects and 
deduces the precursor signals reflecting congestion and packet loss on this specific TCP 
connection network path. With the above steps, the congestion degree and congestion 
recovery mechanism appropriate for the available bandwidth of the current path, and 
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packet loss judgment and recovery can be determined in light of these dynamic intelligent 
learning results and the transmission rate. 

ZetaTCP is implemented by an automatic learning state machine (Learning State- 
Machine), as indicated in Fig. 2. 
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Fig. 2. ZetaTCP automatic learning state machine 


Each automatic learning state machine matches a TCP connection, records the net- 
work path characteristics of the TCP connection, and dynamically determines appropri- 
ate congestion judgment, recovery mechanism and the packet loss judgment mechanism. 
In specific, connection management can directly extract the external features of the net- 
work path and input them to the machine. The intelligent learning outcomes accumulated 
by this machine are subject to the packet loss monitoring, congestion control, exception 
handling and delay monitoring modules to adjust the transmission behavior of the cor- 
responding TCP connection. The dynamic feedback can be conveyed to the automatic 
learning state machine through the exception handling and congestion control modules 
to optimize network path learning further. 


3.3 Implementation Algorithm of ZetaTCP Optimization 


On Linux, Netfilter is enabled for packet interception. In different deployment sce- 
narios, Netfilter can perform Hooks at the Ethernet bridge level to implement the 
transparent bridge mode or conduct Hooks at the INET level to perform the routing 
mode. A pair of Hook points can be mounted to the LAN and WAN of the engine on 
NF_INET_POST_ROUTING and NF_INET_PRE_ROUTING respectively as shown 
in Fig. 3. 
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Fig. 3. Implementation of Hook point of ZetaTCP 


The ZetaTCP’s congestion control algorithm is available, as shown in Fig. 4. 
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Fig. 4. Process flow of the congestion control algorithm of ZetaTCP 


For the data message with the highest sequence number in the received ACK 
response, the actual instant throughput rate is calculated as per BC = FS/(T — TS). 
Wherein, T denotes the current time, TS indicates the sending time of the data packet 
with the highest sequence number, and FS means as the total amount of data sent at this 
TS time and fails to be responded to by ACK. The said TS and FS are recorded when 
the data packet with the highest sequence number is sent. 
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The smooth throughput rate is determined according to B = (1 — a) * B’ + a* BC; 
Wherein, a refers to a constant parameter, BC means the actual instant throughput, and 
B’ suggests the last calculated smooth throughput rate. 

CWND growth modes fall into categories of exponential growth, linear growth, and 
stop. Provided that the increase in smooth throughput rate exceeds the previous smooth 
throughput rate, the CWND growth mode is set as exponential growth. If the smooth 
throughput rate declines continuously for a predetermined number of times, and the 
total amount of smooth throughput rate drops is not less than the throughput rate drop 
threshold, it is required to judge further whether the current smooth round-trip delay 
SRTT is less than or equal to n * RTTMIN; Wherein, RTTMIN means the smallest 
round-trip delay, and y refers to a constant parameter; If yes, the CWND growth mode 
should set as linear growth; If not, it shall be set to stop. 

ZetaTCP can, through the foregoing algorithm, obtain the real-time optimal CWND 
value, thereby maximizing network throughput and preventing congestion. 


4 Application of ZetaTCP Optimization Technology in Mobile 
Internet 


The packet losses occurred wireless networks are often attributed to signal loss, inter- 
ference and other causes, and the packet loss rate therein is greater than that in wired 
networks [8]. In consequence, standard TCP usually fails to judge these packet losses 
in a quick manner, hence bringing about low transmission efficiency, unstable trans- 
mission quality, high unpredictability, and poor user experience. By contrast, ZetaTCP 
can quickly predict packet loss and recover in time in a wireless network environment, 
making the transmission more stable and quicker, thereby considerably improving the 
user experience. 

In order to verify the actual effect of the ZetaTCP optimization technology in the 
wireless core network, the Performance-enhancement Proxy Based Scheme and the 
LotWan acceleration system using ZetaTCP as the proxy node are introduced to optimize 
the data transmission of the wireless core network and evaluate its optimization effect. 


4.1 ZetaTCP Optimization Deployment Scheme 


As indicated in Fig. 5, the ZetaTCP acceleration device is deployed outside the SGi port 
of the PDN-GW, which is transparently connected in series between the PDN-GW and 
the firewall or on the Internet side of the firewall. The acceleration device is transparently 
connected to the network and works as a TCP proxy to accelerate the coverage of the 
whole wireless network transmission path. 
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Fig. 5. ZetaTCP acceleration device deployment scheme 


4.2 ZetaTCP Optimization Application Results 


The implementation environment selected here covers three kinds of wireless networks 
with poor areas: medium-low field strength coverage, hotspot, and busy-hour regions. 
The data shows, the congestion control algorithm adopted by ZetaTCP in these three 
areas achieves relatively faster transmission speeds. The average results of application 
assessments are listed in the following table (Tables 1, 2 and 3). 


Table 1. Acceleration effect of web browsing services 


Web Time delay | Timedelay | Delay Throughput | Throughput Promotion 

browsing | after without variation | after without of 

services acceleration | acceleration acceleration | acceleration | throughput 
(s) (s) (KB/s) (KB/s) 

Sina 12.66 18.62 31.99% | 425.55 261.95 62.45% 

Sohu 5.57 11.77 52.67% 109.86 65.70 67.22% 

NetEase 1.44 26.68 45.86% | 117.84 67.40 74.83% 

Mobile 7.93 14.10 43.71% 94.85 67.81 39.89% 

phone 

Tencent 

Mobile 6.37 13.41 52.45% 123.99 82.96 49.46% 

games 

People.cn | 18.46 33.90 45.54% 52.91 30.87 71.37% 

STO 5.47 9.73 43.75% | 133.51 89.56 49.06% 

express 

Harvest 17.97 29.61 39.30% 112.24 64.48 74.07% 

fund 

CCB 6.51 11.10 41.37% 62.89 40.45 55.47% 

Xinhuanet | 5.39 10.12 46.68% | 154.48 95.84 61.18% 

Overall - 44.33% |- 60.50% 

(Average 


value) 


Research and Practice of TCP Protocol Optimization in mobile Internet 195 


Table 2. Acceleration effect of file download services 


File download Rate after Rate without Promotion of throughput 
acceleration (KB/s) | acceleration (KB/s) 

Client terminal of QQ - | 2851.21 1410.61 102.13% 

23.5M 

Client terminal of MM | 2362.47 1159.66 103.72% 

shopping mall - 12.9M 

Overall (Average 102.92% 

value) 


Table 3. Acceleration effect of video download services 


Video download Rate after acceleration | Rate without acceleration | Promotion of 
(KB/s) (KB/s) throughput 
Sohu video 2365.09 1288.95 83.49% 
Youku tudou 2186.83 1127.95 93.88% 
Overall (Average value) 88.68% 


On the basis of the above results, due to the delayed judgment of these packet losses, 
the transmission efficiency of standard TCP is often low with unstable transmission 
quality, which is difficult to predict and seriously affects the user experience. As for 
ZetaTCP acceleration in the wireless network environment, the corresponding connec- 
tion network characteristics can be accumulated through dynamic learning, Making the 
ZetaTCP congestion control algorithm more accurate. It also can predict packet loss very 
quickly and recover in time, making the transmission more stable and quicker, thereby 
significantly improving the user’s experience. 


5 Conclusion 


TCP optimization, an essential approach for telecom operators to optimize the wireless 
core network, remarkably boosts the Internet access rate of mobile phone users, enhances 
user perception, and adds to the competency of traffic management. 

This paper proposes, an improvement scheme of ZetaTCP performance in the wire- 
less network environment is proposed by combining it with the enhanced TCP congestion 
control mechanism and applying it in the production environment of the current network. 
The experimental results demonstrate that ZetaTCP is a good guarantee for TCP users 
to get the appropriate bandwidth as defined by the flow specification. It can eliminate the 
unfairness problem caused by different RTTs when congestion occurs. It both maintains 
the end-to-end semantics of TCP, and takes corresponding measures upon distinguish- 
ing the types of network packet loss, thereby bolstering the transmission performance 
of TCP. 
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Abstract. Accurate and stable traffic sign detection is a key technology to achieve 
L3 driving automation, and its performance has been significantly improved by 
the development of deep learning technology in recent years. However, the current 
traffic sign detection has inadequate difficulty resisting anti-attack ability and even 
does not have basic defense capability. To solve this critical issue, an adversarial 
patch attack defense model YYOLO-TS is proposed in this paper. The main innova- 
tion is to simulate the conditions of traffic signs being partially damaged, obscured 
or maliciously modified in real world by training the attack patches, and then add 
the attacked classes in the last layer of the YOLOv2 which are corresponding to 
the original detection categories, and finally the attack patch obtained from the 
training is used to complete the adversarial training of the detection model. The 
attack patch is obtained by first using RP2 algorithm to attack the detection model 
and then training on the blank patch. In order to verify the defense effective of the 
proposed IYOLO-TS model, we constructed a patch dataset LISA-Mask contain- 
ing 50 different mask generation patches of 33000 sheets, and then training dataset 
by combining LISA and LISA-Mask datasets. The experiment results show that 
the mAP of the proposed YYOLO-TS is up to 98.12%. Compared with YOLOv?, it 
improved the defense ability against patch attacks and has the real-time detection 
ability. It can be considered that the proposed method has strong practicality and 
achieves a tradeoff between design complexity and efficiency. 


Keywords: Traffic sign detection - Adversarial patch attack - Deep learning 


1 Introduction 


Traffic sign detection is a key technology that is continuously updated and iterated in 
the vision-based advanced driver assistance systems. Its purpose is to establish accurate, 
real-time and safe traffic sign recognition capabilities for complex and dynamic real 
roads [1]. The most widely used technology is target detection based on Deep Neural 
Networks (DNN) [2]. However, many recent studies have shown that the security of DNN 
models is not reliable, that is, it is susceptible to the influence of adversarial examples, 
which would mislead the classifier produces incorrect predictive output [3-5]. Currently, 
adversarial patch attacks in the physical world have been considered as a very effective 
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means for attacking object detection models, and have achieved remarkable results in the 
fields of image classification [6], face recognition [7], object detection and etc. [8—10]. 
In order to deal with the security threats caused by patch attacks, a growing number of 
researchers began to study defense methods. However, current researches mainly focus 
on image classification, and there are few reports on traffic sign detection. In addition, 
traditional image pre-processing methods, such as image denoising [11], local gradient 
smoothing [12], and partial occlusion [13], would reduce the detection accuracy on the 
original samples, and most of them are designed to operate in the digital space and are 
ineffective to the physical world. 

YOLO (You Only Look Once) series is a one-stage object detector that can directly 
output bounding boxes and categories. Compared with RCNN (Region-Convolutional 
Neural Networks), Faster-RCNN and other two-stage networks, YOLO has a lighter 
structure, fewer parameters, and faster speed. Therefore, itis more suitable for application 
research in the field of automatic driving that requires high real-time and accuracy [14]. 
Compared with v3—v5, YOLOv2 has less computation in forward reasoning [15-18], 
and can maintain a relatively high mAP (mean Average Precision) in the COCO dataset 
test under the same scale input. In addition, in automatic driving, object detection models 
are mostly deployed on edge devices for inference, resulting in limited model storage 
space and computing resource [19]. YOLOv2 mainly consists of convolutional layers 
and softmax, which is easier to implement in mobile device and can also accelerate 
inference by small graphics cards. Therefore, the interesting and challenging question 
addressed here is how to integrate and extend YOLOvz? to traffic sign detection and 
achieved the stable defense capability. 

To solve the above problems, we propose an adversarial patch defense model TYYOLO- 
TS (Improved YOLOvz?2 on Traffic Signs) on traffic sign detection. The main contribu- 
tions can be summarized as follows: (1) We extend the research of patch attack defense 
to the field of traffic sign detection and proposed a practical defense model TYYOLO-TS. 
(2) We improved the last layer of YOLOv2 model by adding an additional 11 attacked 
classes, and optimized it structure to ensure the high detection performance for normal 
traffic signs. (3) In order to achieve high robustness and more realistic style against 
perturbations, we adopt RP» algorithm [8] to attack the YOLOv?2 and pioneered the 
development of a patch dataset named LISA-Mask. 


2 Improved YOLOv2? on Traffic Signs Detection Model 


2.1 Framework Design of YYOLO-TS 


Figure | provides an overview of TYYOLO-TS. From the structure of the neural network, 
TYOLO-TS adds 11 additional attacked categories to the last softmax layer. As a result, 
TYOLO-TS is able to detect the attacked targets while accurately identify the attacked 
targets to the true classes, which are defined as the right part of Fig. 1. 
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Fig. 1. Framework of IYOLO-TS. 


We sample from each category of LISA and LISA-mask to train IYOLO-TS. IYOLO- 
TS retains the network structure of yolov2 except for the final softmax layer by adding 
11 attacked categories. The right part of figure is attacked traffic sign detection result of 
TYOLO-TS. The base idea of YOLOv2 is to represent the output of the feature map as 
the center, width and height of the bounding box, as well as the confidence and category. 
YOLOv2 divides the input image x into N preselected areas, and each area predicts M 
anchor box. Assuming that there are n classes to be identified, for the LISA and LISA- 
Mask datasets, n is 11, each anchor box can be written as an (n + 5) dimensional vector. 
The result of the feature map for each anchor box can be expressed as shown below: 


(ET Pa Pees Petsn) (1) 


where È, Y, W, H are the center and size of bounding box, Popj is the confidence score 
indicates the probability of whether the bounding box contains a target and Pesi is class 
score. Then, arrange the anchor boxes in order, and each preselected area would output 
a vector with dimension M (n + 5). Eventually, the output of YOLOv2 is a vector of 
dimension NM (n + 5). YYOLO-TS inherits the form of the YOLOv?2 loss function and 
adds the loss to the attacked class score. We add 11 attacked categories to the last softmax 
layer of YOLOvz?, so the length of each anchor boxes vector becomes (n + 5 + 11), and 
the corresponding final output becomes a vector of NM (n + 5 + 11) dimension. This 
gives [LYOLO-TS two advantages: the detection speed inherited from YOLOv2 meets 
the time-sensitive requirements for defending against physical world attacks and can 
also be used as a model for detecting attacks. 


2.2 RP2-Based Attacking Process 


In order to achieve a high robustness and a more realistic style against perturbations, we 
use the method in [8] to attack the YOLOv?2 detectors. To generate visual adversarial 
perturbations that are robust under different physical conditions, RP» algorithm is first 
derived without considering other physical conditions, starting with the optimal method 
for generating perturbations to a single image x. Then update the algorithm considering 
continuous changes in the distance and angle of the camera to the road sign. Then, the 


202 Y. Zhang et al. 


constrained optimization problem of RP% is expressed as below: 
arg min dlp + J[fo(x +5), y*] (2) 


where J (-) is the loss function measures the degree of difference between the prediction of 
the model and the target class y*. x is the input, ô denotes the perturbation of input x, fo (-) 
denotes the target classifier, and A is the hyperparameter that controls the regularization 
of the distortion. Specifying the distance function as ||ô]|„, which denotes the p-norm 
of 5. To better capture the effects of changing physical conditions, partial experimental 
samples containing random noise are generated to be added to the algorithm iterations. 
To ensure that the perturbation is applied only to the surface of the target object, a mask 
is introduced that will limit the physical region of the perturbation. The final robust 
spatially constrained perturbation is optimized as: 


arg min d||Mx -ôllp + NPS + Exx J {fo [x; + T (Mx - 5)], y*} (3) 


where the matrix M, is the representation of the mask, NPS is the unprintability fraction, 
and the function T (-) represents the alignment function that maps the transformation of 
the object and the perturbation. Since all perturbation values must be reproducible in 
the physical world and there exist some reproduction errors in the colors produced by 
the printer [20], RP2 adds an additional term NPS to the objective function to model the 
printer color reproduction errors. It can be found that during an attack, forged patches 
generated under the qualification of different masks can simulate common vandalism 
behaviors that are ignored by most people. Such attacks in the physical world are highly 
disruptive to traffic sign detectors, so it is imperative to develop appropriate defense 
strategies. 


2.3 Generating of LISA-Mask Dataset 


In order to make TYYOLO-TS more generalizable and make it effective in defending 
against various patch attacks, we generate 50 different masks and constructs a new 
dataset named LISA-Mask to help train the YYOLO-TS. 

During attack patches generating experiment, we found that the patches at different 
locations have an impact on the effectiveness of the attack, and each mask produces a 
different attack effect. In addition, in order to simulate a more realistic random attack 
scenario as much as possible, 50 different masks are produced in this paper by limiting 
the size, distance, number and shape of the scope. The generated masks are different 
from other target detection datasets that can take the whole area as the area of interest 
for the attack, the masks in this paper should limit the size of the scope so that they avoid 
obscuring the whole pattern of traffic signs. 

The success rate of the attack can be expressed as follows: 


Deec {[A(c**)] = y* A fpl) = y} 
ee 0 (c18) = y] 


where A(c*) represent a set of images with incorrect classification results from original 
images set c. c%™8 represent the images taken from distance d and angle g. Respectively, 


(4) 
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mask1 of 70.5% mask2 of 63.6% mask3 of 63.6% mask4 of 61.4% 


mask5 of 72.7% mask6 of 95.5% mask7 of 84.1% mask8 of 52.3% 


Fig. 2. Some of the masks and their attack success rates. 


y is the actual class label of the target, and y* is the detection result of the target after the 
attack. As shown in Fig. 2, some of the generated masks and their attack success rates. 
It can be seen that different kinds of masks can lead to different degrees of reduction 
in YOLO’s inference results, i.e., physical attacks on traffic signs can be simulated to 
some extent. 
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Fig. 3. The generation process of the LISA-Mask dataset. 


Figure 3 exhibited the generation process of LISA-Mask dataset. First, YOLOv?2 is 
trained on LISA training set and named as Modelo, then 50 different masks are generated 
by using the aforementioned method, and then the attack on Modelo is performed on 
different masks based on the method in [8], respectively, the difference of the detection 
results with the true labels is added to the loss function, and the attack patches are 
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updated by back-propagation training. The generated patches are applied on LISA, and 
the images with the patch attacks are obtained, that is named as LISA-Mask dataset. The 
produced dataset contains a total of 11 categories of traffic sign images, each contained 
3000 images that were attacked 50 times, for a total of 33,000 images. 


3 Experiments and Results 


3.1 Test Bench Setup 


To evaluate our proposed work, we constructed the experimental data according to the 
structure in Fig. 4. Firstly, the LISA-Mask and LISA data sets are merged. There are 11 
types of targets and each type of target is divided into clean data and attacked data. Then, 
to keep data balance in training, three enhancement methods is used on categories less 
than 100 pictures in the LISA dataset: contrast, brightness and sharpness change. We 
don’t recommend using cutting, mirroring, rotation and other enhancement methods, for 
these complex situations are not common in driving detection task. Finally, we selected 
two hundred images randomly from each category of data to construct the experimental 
dataset, which is split into 80% training and 20% test set. 


randomly selected 
attacked ) ===>  ( attacked200 


= clean200 
attacked ) ===> attacked200 


merge { i 
— clean200 A | 80% train set 
Experiment 


addlane { 


LISA-MASK attacked ) ==> attacked200 


turnright Dataset 
LISA E = clean200 sy | 20% test set 


attacked = attacked200 


Clean) emp (clean200_) 


Fig. 4. Construction structure of the experimental dataset. 


stopahead { 


For all experiment, we use tensorflow1.14 and P4000 for training. YOLO is trained 
by Adam optimizer with learning rate 0.01, and batch size is 32. In the training of 
adversarial patches, SGD is used with learning rate 0.01, and decay rate is set to 0.1. 


3.2 Object Detection 


Object Selection Performance Analysis of [YYOLO-TS on Clean Dataset 

To evaluate the performance of TYOLO-TS, we calculate the AP of YOLOv2 and 
TYOLO-TS for each class on the LISA test set in Table 1. It can be observed that 
TYOLO-TS has less reduced in AP for each class compared to YOLOv?2. On average, 
the mAP of TYOLO-TS is 97.75%, which is only 1.25% lower compared to YOLOv2, 
indicating that [YYOLO-TS can maintain a strong roadmap detection. 
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Table 1. Performance of YOLOv2 and TYOLO-TS on the LISA test set 

Classes addlane keepright laneend merge signalahead limit25 
YOIOv2 100% 100% 92.39% 98.53% 100% 100% 
TYOLO-TS 100% 100% 91.76% 96.49% 100% 98.63% 

limit30 limit45 stopahead. stop turnright mAP 

100% 100% 100% 100% 100% 99.00% 

100% 100% 100% 100% 100% 97.75% 


Analysis of the Validity of 'YOLO-TS Defense Detection 

To evaluate the defensive capability of YYOLO-TS, we calculated AP of each class on 
the dataset. It can be seen that [YOLO-TS can distinguish the adversarial samples from 
the clean data, and the mAP reaches 98.12%. Table 2 shows the detection AP of [YYOLO- 
TS for all classes of images, and it can be seen that [YOLO-TS has a strong defense 
detection performance. Figure 5 shows the performance of YYOLO-TS and YOLOv2 
against patch attacks. As can be seen that, compared to YOLOv2, TYOLO-TS achieves 
higher metrics in all the other 10 classes of flags except the signalahead class, which 
shows a stronger defense against attacked data. 
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Fig. 5. Performance comparison of IYOLO-TS and YOLOv2 against patch attacks. 


Figure 6 shows the defense effect on LISA-Mask. The attacked addedlane is able 
to successfully trick YOLOv2 to identify it as the merge class, however, YYOLO-TS is 
able to successfully and correctly identify the attacked target. 
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Table 2. TYOLO-TS AP for 22 classes of images, where classes indicate clean traffic signs data 
and classes-ad indicate attacked traffic signs data 


Classes AP Classes-ad AP 
addlane 96.34% addlane-ad 100% 
keepright 100% keepright-ad 100% 
laneend 100% laneend-ad 100% 
merge 91.46% merge-ad 97.67% 
signalahead 92.5% signalahead-ad 93.51% 
limit25 100% limit25-ad 100% 
limit30 100% limit30-ad 100% 
limit45 95.74% limit45-ad 95.65% 
stopahead 100% stopahead-ad 100% 
stop 100% stop-ad 95.45% 
turnright 97.37% turnright-ad 100% 
mAP 98.12% 


Fig. 6. Performance of YOLOv2 and IYOLO-TS for detection of attacked added lane. 


In addition, YYOLO-TS adds 11 additional attacked classes to the structure of 
YOLOv?, as Fig. 7 shows the detection results of some of the attacked classes. It can be 
seen that TYYOLO-TS is not only able to correctly identify the attacked traffic sign, but 
also distinguish whether the traffic sign is under attack or not. It shows that YYOLO-TS 
has good detection ability for different kinds of patch attacks. 
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Fig. 7. Detection results of partially attacked classes. 


3.3 Analysis of the Effectiveness of Patch Attack Defense 


In order to evaluate the defensive capability of YYOLO-TS, we test YYOLO-TS under 
white-box attacks and physical world attacks respectively. 


Defense Effectiveness Analysis under White-box Attacks 

We continue with the LISA-Mask generation process, by using RP2 to generate the 
patch dataset LISA-Maskg against [YYOLO-TS. First, YYOLO-TS was trained on the 
LISA training set, and then images with the patch attack were generated on the LISA 
dataset using RP» against the trained IYOLO-TS to obtain the LISA-Masko dataset. 
Then, the generated patch dataset LISA-Masko was used to test the YYOLO-TS model. 
Table 3 shows the performance of [YYOLO-TS against white-box attacks. 


Table 3. Detection effectiveness of YYOLO-TS against white-box attacks 


Classes addlane keepright laneend merge signalahead limit25 

AP 95.95% 100% 90.79% 90.24% 100% 100% 
limit30 limit45 Stopahead stop turnright mAP 
94.37% 95.35% 98.61% 100% 100% 97.22% 
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As can be seen from the Table 3, except for laneend and merge, which have an 
accuracy of about 90%, other classes have AP values higher than 94%, indicating that 
TYOLO-TS still shows a strong defense capability in the face of new attacks. 


Defense Effectiveness Analysis under Physical World Attacks 

To verify the usefulness of the model in this paper, the defensive performance of [YOLO- 
TS in the physical world was tested. In the experiments, the generated adversarial patches 
are printed and attached to the traffic signs to further compare and demonstrate the 
defense effectiveness of YOLOv2 and TYYOLO-TS. As shown in (a) (d) (g) G) of Fig. 8, 
YOLOv?2 miscalculates under the generated adversarial patch, and the performance of 
(b) (c) (e) © (b) ©) (k) (1) shows that TYYOLO-TS can distinguish the clean data from 


the attack data under physical attacks. 
he N 
-4 
(Q) d) (e) N) 
(i) G) (C3) (C8) 


Fig. 8. Physical world attack test sample. 


4 Conclusion and Future Work 


In this paper, an improved defense model, IYOLO-TS, was firstly proposed to improve 
the anti-attack ability of the traffic sign detection. Firstly, the masks under multi-scale and 
multi-constraint conditions were built to simulate random multi-type physical attacks in 
the physical world, and the first test data set, Lisa-Mask is constructed through annotation 
fusion. On this basis, 11 attacked classes are innovatively added to the YOLOv2 network 
structure, so that the model can distinguish the attack samples from the original samples 
while maintaining the detection capability. In the experiment, we compared the detection 
performance of IYOLO-TS and YOLOv2, and completed the performance test and 
analysis of white-box attack and physical world attack respectively. Experimental results 
show that YYOLO-TS has a good defense ability against the adversarial patch attack from 
the physical world. But it can also be found that the real road traffic signs obscured, to 
be damaged, is far beyond this study at this stage can simulate. In addition, vehicle 
speed, weather, light and other factors will directly affect the processing efficiency of 
the model. Therefore, in our next work, how to optimize the model to adapt dynamic 
environment and achieve a more accurate and interpretable detection method are also 
important and interesting research topics. 
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Abstract. With the continuous emergence of new network threat means, how to 
turn passive defense into active prediction, the rise of Cyber Threat Intelligence 
(CTD technology provides a new idea. CTI technology can timely and effectively 
obtain all kinds of network security threat intelligence information to help secu- 
rity personnel quickly identify all kinds of attacks and make effective decisions 
in time. However, there are not only a large number of redundant information in 
threat intelligence information, but also the problems of Chinese English mixing, 
fuzzy boundary, and polysemy of related security entities. Therefore, identifying 
complex and valuable information from this information has become a great chal- 
lenge. Through the research on the above problems, a named entity recognition 
model in the field of Network Threat Intelligence Based on BERT-BiLSTM-Self- 
Attention-CRF is proposed to identify the complex network threat intelligence 
entities in the text. Firstly, the dynamic word vector is obtained through Bert to 
fully represent the semantic information and solve the problem of polysemy of 
a word. Then the obtained word vector is used as the input of BiLSTM, and the 
context feature vector is obtained by BiLSTM. Then the output result is intro- 
duced into the self-attention mechanism to capture the correlation within the data 
or features, and finally the result is input into CRF for annotation. To verify the 
effectiveness of the model, experiments are carried out on the constructed network 
threat intelligence data set. The results show that the model significantly improves 
the effect of Threat Intelligence named entity recognition compared with several 
other classical models. 


Keywords: Cybersecurity - Named entity recognition - BERT 


1 Introduction 


With the acceleration of the world’s digitization process, the network environment is 
becoming more and more complex. At the same time, the network attack behavior tends 
to be industrialized, and the attack means are becoming more and more diversified. The 
traditional way of building defense strategies and deploying products based on experi- 
ence is difficult to detect [1], intercept, analyze and respond in time and effectively in 
the face of emerging new, persistent, and advanced threats [2]. In this context, Cyber 
Threat Intelligence (CTI) [3] technology came into being. As an important network 
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security knowledge, it can support the construction of a more active network security 
defense [4] mode. Based on all-around intelligence perception and multi-dimensional 
fusion analysis, it can study and judge the overall situation of network security and 
reasonably predict the threat trend, so as to realize dynamic and accurate response to 
network security threats. However, the existing network threat intelligence information 
is also mixed with a large number of invalid or interference information. How to more 
effectively obtain more critical threat intelligence entity information (such as organi- 
zation, software, vulnerability number, etc.) from threat intelligence has become the 
focus of current research. Applying named entity recognition (NER) technology [4] to 
the field of Network Threat Intelligence can effectively solve the problem of extract- 
ing important security entity information from unstructured Threat Intelligence text. 
Automatically identifying network security entities from Internet information, such as 
software, vulnerabilities, attack means, and related network terms, and classifying them 
is an important step in constructing the knowledge map in network security [5]. 


2 Relation Work 


In the early stage, NER tasks were performed using a rule-based and dictionary-based 
approach, which achieved good results when formulating very comprehensive rules and 
dictionaries, but at great cost, so machine learning methods were considered to improve 
the accuracy of NER. Mulwad V et al. [6] identified potential vulnerability descriptions 
through an SVM classifier and used Wikilogy knowledge base to identify vulnerabilities, 
threats, and attacks in Web text. Since SVM cannot consider context information, Joshi 
A et al. [7] used CRF based system to identify important entities and concepts related 
to network security in a given text. In order to better improve the performance of NER, 
we can also consider adding POS, Weerawardhana S et al. [8] identified the key PAG 
parameters embedded in the vulnerability description text by machine learning and POS, 
including software name, version, impact, attacker operation, and user operation. It is 
proved by experiments that entity recognition tasks are carried out in the field of network 
security. The POS method does provide a viable alternative to machine learning. 
Although machine learning [9] has some improvement on NER tasks in network secu- 
rity, it requires network security researchers to label security data, which is extremely 
costly. As a branch of machine learning, deep learning has become increasingly popular 
in recent years. At present, some researchers have applied deep learning to the field 
of named entity identification of network threat intelligence. Pingchuan Ma et al. [10] 
proposed a BiLSTM-CRF method to extract security-related concepts and entities from 
unstructured text and used open-source data to evaluate the model on P, R, and Fl-score 
with good results. Wu H et al. [11] added a domain dictionary matching correction 
method based on BiLSTM-CRF, using BiLSTM to automatically capture context fea- 
tures, using CRF to learn label constraint rules, and using ontology domain dictionary 
to match correction. Qin Y et al. [12] added a feature template (FT) to BiLSTM-CRF 
to extract local context features, and CNN to extract character-level features of security 
entities, such as malware and English naming vulnerabilities. Li T et al. [13] proposed 
a neural network model based on self-attention to identify entities. On the basis of the 
existing BiLSTM-CRF model, the self-attention mechanism was added to extract more 
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context information related to the current word in a sentence and get more information 
about the current word. Han Zhang et al. [14] added GAN to BiLSTM-Attention-CRF 
to obtain tag data and solve the problem of lack of tag data in network security. P Evan- 
gelatos et al. [15] proposed using a transformer to extract named entities in threat intel- 
ligence and verified its validity by experimenting with the threat intelligence (DNRTI) 
dataset [16]. 

However, there is a polysemy in the named entity of Network Threat Intelligence. 
The word vectors obtained by word2vec and glove are static, which cannot solve the 
problem. At the same time, BiLSTM alone cannot obtain more information about the 
current word. Therefore, this paper proposes a BERT-BiLSTM-CREF named entity recog- 
nition method that combines a self-attention mechanism. BERT (Bidirectional Encoder 
Representations from Transformers) [17] the pre-training language model is a dynamic 
word vector based on the language model, which can dynamically adjust the embedding 
of words according to the semantics of the context, better express the representation 
relationship between words and sentences, and solve the problem of polysemy. In addi- 
tion, the self-attention mechanism pays more attention to the important words related 
to the target entity in a sentence, which can better capture the interdependence between 
the current word and other words and extract more context information related to the 
current word. 


3 BERT-BiLSTM-Self-attention-CRF Model 


The BERT-BiLSTM-Self-attention-CRF model is divided into four parts: BERT pre- 
training language model, BiLSTM layer, Self-attention layer, and CRF layer. The 
unstructured text information is converted into dynamic word vectors through BERT, 
then the word vectors are used as input to BiLSTM. The context feature information 
is obtained from the forward LSTM and the reverse LSTM, and then some important 
information is selectively paid more attention and assigned higher weight through the 
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Fig. 1. BERT-BiLSTM-self-attention-CRF model architecture 
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self-attention mechanism. Finally, it is marked in the way of BIO through CRF. The 
model structure is shown in Fig. 1. 


3.1 BERT Model 


Language models are the most important part of named entity recognition, which trans- 
forms the input unstructured text into word vectors. Word2Vec [18] was originally used 
to get the word vector representation in the research of the named entity recognition of 
network threat intelligence. Its core idea is to obtain the vectorized representation of the 
word through the word context, including Skip-gram and CBOW. The former predicts 
the surrounding word by the given central word, and the latter predicts the central word 
by the given context information. In addition, the word vector representation is obtained 
by using the co-occurrence matrix with the Glove [19] method, which considers both 
local and global information. However, Word2vec and Glove are both static word vec- 
tors, and the word vector representation is the same in different contexts. For complex 
network security texts, there is a situation of polysemy. To solve this problem, this paper 
proposes a BERT pre-training language model, which can generate dynamic word vec- 
tor representation to obtain the final representation of word vectors, so as to solve the 
problem of polysemy. 

BERT adopts the encoding part of the bidirectional transformer and has two pre- 
training tasks. The first task is Mask Language, which randomly masks 15% of the 
words with MASK for the input text content, and then infers the masked words from 
the context information. The second task is to predict whether the second sentence is 
the next sentence of the first sentence, which is based on the first task, is marked with 
IsNest/NoNext by randomly selecting two sentences in the pre-training text. Figure 2 
shows the structure of the BERT model. 


Fig. 2. BERT architecture 


The input representation of BERT consists of three parts: Token Embedding, Seg- 
ment Embedding, and Position Embedding. By adding and summing these three vectors 
together as the final input, feature extraction is performed in the encoding part of the 
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bidirectional transformer, and finally the sequence vector with rich semantics. The input 
representation is shown in Fig. 3. 


Input [CLS] Cacti has an authorization issue vulnerability [SEP] 
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Fig. 3. Input representation of BERT 


3.2 BiLSTM Layer 


The traditional neural networks cannot memorize the input context information and 
infer the content from the previous information. This paper uses LSTM to solve this 
problem better. The model has a memory function, and can better capture the long- 
distance dependency. It can learn the information that needs to be forgotten and needs 
to be remembered through training. Its structure is shown in Fig. 4. 


Fig. 4. LSTM structure 


Its structure is composed of a forgetting gate, a memory gate and an output gate. It 
is controlled by the unit status. The implementation of LSTM is denoted as follows: 


fa = o (Wp o [h1, x1] + br) a) 
i = o (Wi ° UERJ =F bi) (2) 
Č, = tanh(We e [y-1, x] + bc) (3) 


Ci = fix C1 + i * Č; (4) 
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Oo = o (Wo ° [Ar-1, xr] Æ bo) (5) 


hy = or x tanh(C;) (6) 


where x; is the input vector, f; is the forgetting gate, i, is the memory gate, o; is the 
output gate, C; is the unit status of the time f, and hy is the hidden state of the time t. 

However, the LSTM cannot encode the information from the back to the front. Adding 
the reverse LSTM can better obtain the following information, that is, the BILSTM model 
can better capture the bidirectional semantics, as shown in Fig. 5. 
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Fig. 5. BiLSTM model structure 


In the text, the word vector output from the Bert layer is used as the input of the 
forward LSTM to obtain the forward feature information h; and the reverse feature 
information hy’, and then the two are spliced to obtain the final hidden state H;, as shown 
below: 


H, = [h W] (7) 


3.3 Self-attention Layer 


In order to better understand the effective information in the threat intelligence text, this 
paper proposes to add a self-attention mechanism after BiLSTM, which can capture the 
correlation between vectors, selectively pay more attention to some important informa- 
tion in the feature vector of BiLSTM layer output, give higher weight, and give lower 
weight to other information. The process of calculation the self-attention mechanism in 
this paper is as follows. 

First, the hidden state of the BiLSTM layer output is represented as H;, and the 
vector-matrix Q, K, and V are obtained by mapping the vector H;: 


O=H,W2 
K =H,w* (8) 
V =H,W’ 
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where W2, WX, and WY are the parameters learned in the training process, and then 
calculated by scaling the dot product attention. The calculation formula is as follows: 


Attention(Q, K, V) = Soft (Æ) (9) 
ention(Q, K, V) = Soft max 
Vd 
1/,/dx is used to prevent the result from being too large. Finally, the result is 
normalized by using the Softmax function and multiplied by V to get the result. 


3.4 CRF Layer 


Conditional random fields (CRF) is a conditional probability model used to solve the 
maximization of sequence probability. In the threat intelligence NER task, BiLSTM is 
good at processing long-distance text information, but cannot deal with the dependency 
between adjacent tags. CRF can obtain the best prediction sequence through the rela- 
tionship between adjacent tags, which makes up for the deficiency of BiLSTM. CRF 
ensures the validity of prediction tags by adding restriction rules to the final predicted 
tags. During the training process, these restriction rules are automatically learned by the 
CRF classifier, and the Viterbi is used to find the most likely tag sequence. 

Given the input sequence X = {X1, X2,..., Xn} of a sentence corresponds to the 
prediction sequence Y = {y}, y2,..., Yn}, and the score corresponding to the prediction 
sequence Y is calculated. The formula is as follows: 


n n 
S(X,Y)= Avena my yin (10) 
i=0 i=1 


where A represents the transfer matrix of the label, P represents the label score, which 
is used to predict the probability of sequence Y, and the formula is as follows: 


(X.Y) 


p2 s(x, 7) 


P(Y|X) = (11) 


where Y represents the correctly marked sequence and Yx represents the marked 
sequence. Logarithmically on both sides of the above formula to obtain the likelihood 
function of the prediction sequence. The formula is as follows: 


In(P(Y|X)) = (X,Y) —In} Y s(x, ?) a2) 
Yevy 


Finally, a set of tag sequences with the highest probability is calculated by Viterbi. 


4 Experimental Analysis 


4.1 Dataset Construction 


Since there is no public Chinese named entity identification dataset in network security, 
this paper mainly obtains the required data from the websites related to network security 
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vulnerability through python, such as the National Information Security Vulnerability 
Sharing Platform (www.cnvd.org.cn), Information Security Vulnerability Portal (http:// 
cve.scap.org.cn) 360 Network Security Response Center (cert.360.cn) and national Inter- 
net Emergency Center (www.cert.org. cn) are divided into nine types (as shown in 
Table 1), labeled with BIO. B represents the first word of the entity, I represents the 
intermediate word of the entity, and O represents the non-entity. 


Table 1. Entity labeled mode 


Entity type BIO mode 

person B-person/I-person 

org B-org/I-org 

changjia B-changjia/I-changjia 
software B-software/I-software 
cve_id B-cve_id/I-cve_id 
cnvd_id B-cnvd_id/I-cnvd_id 
vul_name B-vul_name/I-vul_name 
date B-date/I-date 

term B-term/I-term 

other B-other/I-other 
non-entity O 


The labeled dataset is divided into the training set, test set, and verification set in 
7:2:1 (as shown in Table 2). 


Table 2. Dataset size 


Type Train Test Dev 
person 1285 395 158 
org 268 83 26 
changjia 691 221 84 
software 5639 1706 712 
cve_id 1906 467 190 
cnvd_id 1503 595 219 


(continued) 
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Table 2. (continued) 


Type Train Test Dev 
vul_name 3115 954 386 
date 1250 484 200 
term 734 191 65 
other 1695 429 135 


4.2 Evaluation Metrics 


How to evaluate the performance of NER is a crucial step in the NER task. Through 
evaluation, we can analyze the advantages and existing problems of the proposed algo- 
rithm. At present, there are three main evaluation indicators to measure the performance 
of NER tasks: Precision, Recall, and Fl-score. 

Precision refers to the probability that all the samples predicted to be positive are 
actually positive. The formula is as follows: 


TP 


Pa x100% (13) 
TP + FP 


For the original sample, the recall rate refers to the probability of being predicted as a 
positive sample in the actually positive sample. The formula is as follows: 


TP 


R = ———_ 100% (14) 
TP + FN 


Obviously, the above two evaluation indicators are contradictory and cannot meet 
the requirements that the precision and recall can reach the best. Therefore, the Fl-score 
is balanced, and the precision and recall rate are considered to maximize the two as much 
as possible. As a comprehensive index to balance the impact of precision and recall, its 
formula is as follows: 


_ 2*P*R 


Fl= * 100% (15) 
P+R 


where 7P refers to the number of samples that are actually positive and predicted to be 
positive, FP refers to the number of samples that are actually negative and predicted to 
be positive, FN refers to the number of samples that are actually positive and predicted 
to be negative. 


4.3 Experimental Results 


Experiments are carried out on the constructed network security data set. In order to 
verify the rationality of the proposed model, the model is compared with several classical 
models in the named entity recognition task. The comparison results are shown in Table 3. 

For the task of named entity recognition in network security, more features are 
needed for recognition, and the state of the current time should be related to the state 
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Table 3. Comparison of different models (%) 


Models P R F1 

HMM 85.78 80.97 81.14 
BiLSTM 88.07 81.78 83.17 
BiLSTM-CRF 89.22 83.67 85.37 
BERT-BiLSTM-CRF 90.21 91.90 91.04 
BERT-BiLSTM-Self-attention-CRF 92.45 94.20 93.32 


of the previous time and the next time, while the current state in HMM is only related 
to the previous state. From the experimental results, it can be seen that the F1 value of 
BiLSTM is higher than that of HMM. BiLSTM cannot learn the relationship between 
state sequences. After adding CRF, it can learn state sequences. Compare the BiLSTM- 
CRF model with BERT-BiLSTM-CREF, the experimental results show that because BERT 
can deeply extract the semantic information of network security text and fully reflect the 
polysemy of a word, the Fl-score has been significantly improved. 

Comparing the BERT-BiLSTM-CRF model with the BERT-BiLSTM-CRF model 
proposed in this paper, which combines the self-attention mechanism, the precision, the 
recall, and Fl-score are improved. Due to the addition of the self-attention mechanism, 
the model is better at capturing the correlation between the data in the full text of 
network security by calculating the interaction between words, so that the Fl-score of 
the model proposed in this paper is 2.28% more than the BERT-BiLSTM-CRF model. 
It has achieved good results in the task of network security named entity recognition. 


5 Conclusion and Future Work 


Threatening intelligence has gradually become one of the hot areas of network secu- 
rity. At present, government departments and network security enterprises pay more 
attention to the development of threatening intelligence, and the demand for threat- 
ening intelligence in all walks of life is growing. However, there are some problems 
in Network Threat Intelligence entities, such as ambiguous words, mixed Chinese and 
English, blurred boundary, etc. To solve these problems, this paper presents network 
security named entity recognition model based on BERT-BiLSTM-CRF, which com- 
bines a self-attention mechanism, uses a BERT pre-training language model to generate 
word vectors dynamically through two-way Transformer structure, mining syntax struc- 
ture, and semantic information, and introduces a self-attention mechanism to calculate 
the correlation between words. Distance dependence can be better solved by assigning 
different weights to different words according to their degree of association. Experi- 
ments show that the model has a certain improvement in P, R, and Fl-score, and has 
a good recognition effect. It can complete the actual network threat intelligence entity 
identification work and solve the difficulties of threat intelligence entity identification 
and the ambiguity of one word. 
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However, there is still much space to improve the task of identifying named entities 


for network threat intelligence. Because there are still a large number of unmarked 
network security corpora in a specific area, transfer learning can be considered in future 
research to solve the problem of lack of labeled data. The performance of identifying 
network threat intelligence entities can be further improved by expanding the size of the 
corpus. 
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Abstract. Accurate identification of Internet buzzwords plays an important role 
in positive Internet opinion guidance. A Transformer-based Internet buzzword 
feature recognition system was designed to address this problem. The traditional 
way of crawling data has been improved, a real-time crawling module has been 
added, and an Internet buzzword corpus has been constructed by itself. The tradi- 
tional way of crawling data has been improved, a real-time crawling module has 
been added, and an Internet buzzword corpus has been constructed by itself. Tra- 
ditional machine learning models suffer from gradient disappearance and gradient 
explosion, the Transformer model, with its parallel computing and self-attentive 
mechanism, is a good solution to these problems, and its bi-directional connection 
allows the parameters of the context to be updated uniformly, thus allowing better 
aggregation of information and solving the problem of scattered contextual infor- 
mation. Transformation of the position-encoded part of the Transformer model 
starts with a relative position representation (RPR). It compensates for its inabil- 
ity to obtain relative location information. The experimental results show that the 
improved Transformer model can achieve an accuracy rate of 90.1%, a recall rate 
of 92.13%, and an F1 value of 91.16% in recognizing Internet buzzwords. 


Keywords: Internet buzzwords - Transformer model - Relative position 
representation (RPR) 


1 Introduction 


With an Internet penetration rate of 73.0% as of December 2021 [1], Internet has become 
an essential part of people’s lives. Internet has given the public more channels to express 
their ideas, and Internet buzzwords are the concentrated product of expressing ideas, but 
there are positive and negative Internet buzzwords, and while they express the ideas of 
Internet users, they may produce negative public opinion guidance. Therefore, accurate 
identification of Internet buzzwords plays an important role in the guidance of correct 
Internet opinion. 
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The system applies deep learning techniques to achieve recognition of Internet buz- 
zwords. Deep learning techniques can extract, transform and combine features from the 
initial text to obtain a set of feature representations, and then input a prediction function 
to obtain the recognition results [2]. Deep learning is built around the implementation 
of three functional components: the embedding layer, the encoding layer, and the output 
layer, embedding layer convert words into feature vectors, the Encoding layer obtains 
textual contextual features, and the output layer acquires the rules between sequences 
and classifies their output [3]. Although RNN structures are widely used to process 
sequence-like time-stream data [4—6], they suffer from structural problems such as serial 
computation, gradient disappearance [7], and one-way construction. The contributions 
of applying the Transformer model for web buzzword feature recognition are as follows: 
(1) In the data crawling, the module of real-time crawling is added, which can obtain the 
data of Internet buzzwords more accurately and improve the problem that the traditional 
crawling data is too slow to update. (2) The current web buzzword dataset is scattered 
and sparse so the data collected through web crawling is used to build a dynamic web 
buzzword corpus on its own. (3) Traditional machine learning models suffer from the 
problem of gradient disappearance and gradient explosion. The Transformer model, with 
its parallel computing and self-attentiveness mechanism, solves these problems, and its 
bi-directional connection allows the parameters of the context to be updated uniformly, 
thus enabling better information aggregation and solving the problem of information dis- 
persion in the context. (4) Improvements to the start position of the Transformer model, 
converting the encoding vector of the starting position to a relative position [8] repre- 
sentation (RPR), compensate for the necessity to introduce explicit location information 
at the location code. 


2 Related Work 


The existing literature on the identification of Internet buzzwords and Internet neologisms 
summarizes three types: rule-based methods, statistical-based methods, and methods 
based on a combination of statistics and rules. 

The rule-based approach focuses on developing rules that share common features 
between words, words, and words, based on linguistic theory and knowledge, or on 
observing the rules and patterns of word formation through long-term study of the lan- 
guage, and then summarizing their properties and combining them with grammar. As the 
core of the rule-based approach to new word discovery is the construction of a knowledge 
base for the domain, a more specialized rule base needs to be created, and new words 
need to be discovered based on the degree of similar recognition in its rule base when 
carrying out online buzzword identification. The statistical-based approach improves on 
the drawbacks of the rule-based approach which uses extensive manual annotation, sav- 
ing significant time and labor costs. Even though the statistical-based approach makes 
up for many of the shortcomings of the rule-based approach, experiments in the litera- 
ture have shown that the statistical-based approach has a low recognition rate that does 
not allow for good recognition of words, while a fusion of the two can improve the 
recognition rate of Internet buzzwords. The literature [9] proposes a kth order algorithm 
for PMI, and experiments show that its accuracy is improved by about 28.79% over 
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PMI, and it is found that when the parameter k takes a value greater than or equal to 3, 
it can overcome the defects of the PMI method. The Transformer model is also based 
on a combination of statistical and rule-based methods and has been applied to Internet 
buzzwords to improve recognition rates. 


3 Overall System Architecture 


The Transformer deep learning model is applied to identify the features of Internet 
buzzwords, and the overall process of the system is shown in (see Fig. 1). 


User login 


Fig. 1. Overall system flow chart 


Firstly, the user logs in to the Internet buzzword recognition system and enters the 
text to be analyzed on the text analysis page. The Internet buzzword database in the 
background makes a judgment on the text entered, if it is an Internet buzzword in the 
corpus then it is directly identified as an Internet buzzword, if it does not exist in the 
Internet buzzword database then the input is entered into the Transformer model to 
determine if it is an Internet buzzword. 

The Transformer Internet-based buzzword recognition technology solution is imple- 
mented in the following steps: 

Step1, to crawl the existing Internet buzzword corpus on Weibo, to achieve real-time 
incremental crawling of Internet buzzwords on the original crawler technology, need 
to mark an identifier on the URL that is the data fingerprint, set the data fingerprint as 
a hash value, and then just compare the hash value to determine whether the crawled 
content needs to be updated. 

Step2, the crawled Internet buzzwords were pre-processed by first de-duplicating 
the data, followed by word separation for the longer phrases, using search engine mode, 
and then filtering the deactivated words using Baidu’s deactivated word list. 

Step3, use matplotlib library, jieba library, and word cloud library to realize the 
visual display of the processed Internet buzzwords and draw the word cloud of Internet 
buzzwords. 
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Step4, the pre-processed data is selected for the text vector representation by the 
Skip-gram method in the word2vec model. 

Step5, for the feature vectors obtained in the previous step, position encoding is 
performed, and a position vector representing position information is combined on word 
embedding to obtain the final vector with position information. 

Step6, input the vector with location information into the Transformer model and 
determine whether the input is a web buzzword or not. 


4 System Implementation 


4.1 A Subsection Sample 
4.1.1 Data Acquisition 


Real-time incremental crawling of Internet buzzwords is done by tagging URLs with a 
data fingerprint identifier. Set data fingerprint to the hash value, and generate a unique 
fixed-length string from the input words, the hash values are then compared to determine 
if the crawl needs to be updated. The former can insert a piece of data into the collection, 
returning | for success and 0 for failure; the latter can query whether an element exists 
in the collection, returning 1 for existence and 0 for non-existence. (see Fig. 2), when the 
Spider module receives a URL to process, a Spider middleware is added to determine 
whether the fingerprint of the URL exists in the Redis database and if so, the URL is 
discarded; if not, the new URL is fetched and crawled. 


abandon this request 


Fig. 2. Real-time web crawling flow chart 
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4.1.2 Data Pre-processing 


By counting the content crawled by the keyword “Internet buzzwords”, a total of tens of 
thousands of high-frequency Internet buzzwords were crawled. Firstly, tens of thousands 
of buzzwords were de-duplicated, applying the duplicated() function of pandas, a data 
analysis tool in python, to detect duplicate data, duplicate rows with small indexes 
will return “True”, and data marked as True will need to be removed by applying the 
drop_duplicates() function. 

The next step is to apply python’s third-party Chinese word splitting library, jieba, 
to the longer phrases in the crawled Internet buzzwords. According to the size of the 
granularity of the Internet, buzzword decided to use the more accurate search engine 
mode in the above for the word splitting process, for long words to cut the command as 
follows: jieba.cut_for_search(); jieba.lcut_for_search(). 

The next step is to filter the craw] data for English characters, numbers, mathematical 
characters, punctuation marks, single Chinese characters that are used very frequently, 
inflectional auxiliaries, adverbs, prepositions, conjunctions, etc. This article uses the 
Baidu deactivation word list filter. 


4.1.3 Constructing an Online Buzzword Feature Vector 


The pre-processed data is transformed into a character vector using the Word2vec model 
for characters. The Word2vec module is called from the Genism package. The Word2vec 
module contains two methods for vectorizing text, CBOW, and Skip-gram, respectively. 
In the training process, sg = | is set and the algorithm of Skip-gram is used for training. 
The window_size of the sliding window is set to 5, the dimension of the size word vector 
is set to 100, and min_count is used for the filtering operation. Words with a frequency 
less than the set value will be discarded, which is set to 5 in this paper. Skip-gram is the 
prediction of surrounding words using central words, for each central word there are K 
words as output, and there are K predictions for a word, for a total of K * V. 

The model training process is as follows: (1) Use center_words V to query WO and 
target_words T to query W1 to get two tensors of shape [batch_size, embedding_size], 
respectively, denoted as H1 and H2. (2) The two tensors are then dotted together. (3) 
Using a sigmoid function acting on (2), the result of the above dot product is normalized 
to a probability value of 0-1 as the predicted probability, and this model can be trained 
based on the label information L. After finishing the training of the model, WO is generally 
used as the final word vector to be used, represented by a vector of WO. Using vector 
dot product, the similarity between different words can be calculated. 


4.2 Transformer Model 


The Transformer model was proposed by Vaswani A. et al. in their paper “Attention Is 
All You Need” [10], published in late 2017, and the general structure is shown in (see 
Fig. 3). 
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Fig. 3. Transformer structure diagram 


An analysis of the location coding place in the traditional Transformer model, since 
the Transformer model, does not have the iterative operation of a recurrent neural net- 
work, and no access to relative position information, so the position information of each 
word must be provided to the Transformer. Transformation of the position encoding 
part of the Encoder, the decoder part of the Transformer model into a relative posi- 
tion representation (RPR), compensating for its inability to obtain relative location 
information. 

Two-position encoding vectors of the model need to be learned, one for computing 
zi and one for computing e;j. If the middle index is k, then there will be 2k + 1 relative 
position encoding vectors to learn, of which k are to its left, k is to its right, and one 
belongs to itself. Relative positional encoding is not used in the traditional Transformer 
to calculate the degree of attention i pays to j after SoftMax for word i and word j. 
Comparing the two calculation methods, it is easy to see that the RPR calculation is 
more accurate for the position, so the model uses RPR for both the Encoder and Decode 
parts of the position encoding. 


5 Analysis and Visualization of Experimental Results 


5.1 Experimental Parameters 


The number of layers is set to 2 by default, and the value of 128 is set to True. BIDI- 
RECTIONAL is set to True to analyze the sequence from front to back and from back 
to front. Table 1 lists the parameters of the Transformer model and their corresponding 
optimal parameter values. 
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Table 1. Transformer model parameters. 


Parameters Value 
BATCH_SIZE 128 
HIDDEN_DIM 768 
OUTPUT_DIM 1 
N_LAYERS 2 
BIDIRECTIONAL True 
DROPOUT 0.25 
SEED 1234 
MAX_INPUT_LENGTH 80 
N_EPOCHS 5 


5.2 Comparative Experiments 


The experimental evaluation of the network structure for the recognition rate of Internet 
buzzwords was evaluated using the precision Pre, recall Rec, and F1 values to evaluate 
the effectiveness of Internet buzzword recognition. To verify the performance of the 
Transformer model proposed in this paper, the feature vectors of Internet buzzwords 
were used as the input vectors of the model, and the accuracy recognition results of the 
comparison experiments on top of the single models commonly used by CRF, LSTM, 
BILSTM and CNN [12] are shown in Table 2. 


Table 2. Recognition performance of the models. 


Model Accuracy rate P (%) Recall rate R (%) F1 (%) 
CRF 76.6 85.85 80.96 
LSTM 20.83 24.01 22.3 
BILSTM 87.46 87.6 87.53 
CNN 63.47 67.41 65.38 
TRANSFORMER 90.1 92.13 91.16 


The experimental results show that the Transformer structure-based online buzzword 
recognition model is the best over the common single models of CRF, LSTM, BILSTM 
and CNN. The LSTM model has the lowest recognition rate for irregular words such as 
Internet buzzwords because it can only extract information from above, not below, and its 
F1 value is only 22.3% which is ineffective for the recognition of Internet buzzwords. The 
CNN model has an F1 value of 65.38%, which is an average performance in buzzword 
recognition compared to other models. The F1 value of the model using BILSTM is 
87.53%, which is a 6.57% improvement compared to the CRF model and still performs 
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relatively well in buzzword recognition. Applying the Transformer model performed 
best in terms of precision Pre, recall Rec, and F1 values, with 90.1%, 92.13%, and 
91.16% respectively. 

The evolution of the experimental evaluation parameter accuracy P is shown in (see 
Fig. 4), the evolution of the evaluation parameter recall R is shown in (see Fig. 5), and 
the evolution of the evaluation parameter F1 is shown in (see Fig. 6). 


Accuracy rate P (%) 


BISTM 
Model 


æ- Accuracy rate P (%) 


Fig. 4. Comparison of accuracy P (%) across models 


Recall rate R (%) 


BILSTM 
Model 
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Fig. 5. Comparison of recall R (%) across models 
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Fig. 6. Comparison of F1 (%) across models 


The line graph of the experimental results reveals that the Transformer-based model 
has the highest accuracy, recall and F1 score, with the change curve at the top, at 90.1%, 
92.13% and 91.16% respectively, and the experimental data shows that the model in this 
paper improves the recognition rate of Internet buzzwords. 
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5.3 Visualization of Internet Buzzword Recognition 


Internet buzzword recognition system based on python’s Flask lightweight web frame- 
work to implement a visual interface. The platform for the visualization of Internet 
buzzwords allows users to view options for data queries, real-time analysis, and hot 
topics in the sidebar of the home page after logging in, the data query is shown in (see 
Fig. 7): it contains all data, Internet buzzwords, non-Internet buzzwords, and allows you 
to view information such as user name, posting time and content, device information, 
number of likes, retweets and comments, and whether the data is an Internet buzzword. 
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Fig. 7. Visualization of data enquiry pages 


The real-time analysis is shown in (see Fig. 8), where the words to be discriminated 
are entered at the content of the input, the probability of their prediction score is displayed 
at the sentiment score, and whether they are suspected to be Internet buzzwords is 
displayed at the sentiment evaluation column. 


BAL aS RTS 


Fig. 8. Example of real-time analysis page visualization 
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6 Conclusion 


To improve the recognition rate of Internet buzzwords, Transformer-based Internet buz- 
zword feature recognition is proposed. The module of real-time crawling has been added 
to the data crawling, which can obtain the data of Internet buzzwords more accurately 
and improve the problem of too slow an update of traditional crawling data. As buzzword 
datasets on the web are scattered and sparse, a dynamic corpus of Internet buzzwords 
is constructed in-house from data collected through web crawling. Traditional machine 
learning models suffer from the problem of gradient disappearance and gradient explo- 
sion. The Transformer model, with its parallel computing and self-attentiveness mecha- 
nism, solves these problems, and its bi-directional connection allows the parameters of 
the context to be updated uniformly, thus enabling better information aggregation and 
solving the problem of information dispersion in the context. Improvements to the start 
position of the Transformer model, converting the starting position-coding vector to a 
relative position representation (RPR). It compensates for the need to introduce explicit 
location information at its location code. 
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